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Preface 



On behalf of the Program Committee, it is our pleasure to present to you the 
proceedings of the 7th Symposium on Recent Advances in Intrusion Detection 
(RAID 2004), which took place in Sophia- Antipolis, French Riviera, France, 
September 15-17, 2004. 

The symposium brought together leading researchers and practitioners from 
academia, government and industry to discuss intrusion detection from research 
as well as commercial perspectives. We also encouraged discussions that ad- 
dressed issues that arise when studying intrusion detection, including informa- 
tion gathering and monitoring, from a wider perspective. Thus, we had sessions 
on detection of worms and viruses, attack analysis, and practical experience 
reports. 

The RAID 2004 Program Committee received 118 paper submissions from 
all over the world. All submissions were carefully reviewed by several members 
of the Program Committee and selection was made on the basis of scientific 
novelty, importance to the field, and technical quality. Final selection took place 
at a meeting held May 24 in Paris, France. Fourteen papers and two practical 
experience reports were selected for presentation and publication in the confer- 
ence proceedings. In addition, a number of papers describing work in progress 
were selected for presentation at the symposium. The keynote address was given 
by Bruce Schneier of Counterpane Systems. Hakan Kvarnstrom of TeliaSonera 
gave an invited talk on the topic “Fighting Fraud in Telecom Environments.” 

A successful symposium is the result of the joint effort of many people. In 
particular, we would like to thank all authors who submitted papers, whether 
accepted or not. Our thanks also go to the Program Committee members and 
additional reviewers for their hard work with the large number of submissions. In 
addition, we want to thank the General Chair, Refik Molva, for handling confer- 
ence arrangements, Magnus Almgren for preparing the conference proceedings, 
Marc Dacier for finding support from our sponsors, Yves Roudier for maintaining 
the conference Web site, and Herve Debar at France Telecom R&D for arranging 
the Program Committee meeting. Finally, we extend our thanks to the sponsors: 
SAP, France Telecom, and Conseil Regional Provence Alpes Cote d’Azur. 
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Abstract. One of the most dangerous cybersecurity threats is control 
hijacking attacks, which hijack the control of a victim application, and 
execute arbitrary system calls assuming the identity of the victim pro- 
gram’s effective user. System call monitoring has been touted as an effec- 
tive defense against control hijacking attacks because it could prevent re- 
mote attackers from inflicting damage upon a victim system even if they 
can successfully compromise certain applications running on the system. 
However, the Achilles’ heel of the system call monitoring approach is 
the construction of accurate system call behavior model that minimizes 
false positives and negatives. This paper describes the design, imple- 
mentation, and evaluation of a Program semantics- Aware Intrusion De- 
tection system called Paid, which automatically derives an application- 
specific system call behavior model from the application’s source code, 
and checks the application’s run-time system call pattern against this 
model to thwart any control hijacking attacks. The per-application be- 
havior model is in the form of the sites and ordering of system calls made 
in the application, as well as its partial control flow. Experiments on a 
fully working Paid prototype show that Paid can indeed stop attacks that 
exploit non-standard security holes, such as format string attacks that 
modify function pointers, and that the run-time latency and through- 
put penalty of Paid are under 11.66% and 10.44%, respectively, for a 
set of production-mode network server applications including Apache, 
Sendmail, Ftp daemon, etc. 

Keywords: intrusion detection, system call graph, sandboxing, mimicry 
attack, non-deterministic finite state automaton 



1 Introduction 

Many computer security vulnerabilities arise from software bugs. One particular 
class of bugs allows remote attackers to hijack the control of victim programs 
and inflict damage upon victim machines. These control hijacking exploits are 
considered among the most dangerous cybersecurity threats because remote at- 
tackers can unilaterally mount an attack without requiring any special set-up or 
any actions on the part of victim users (unlike email attachment or web page 
download). Moreover, many production-mode network applications appear to be 
rife with software defects that expose such vulnerabilities. For example, in the 
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most recent quarterly CERT Advisory summary (03/2003) [4], seven out of ten 
vulnerabilities can lead to control hijacking attacks. As another example, the no- 
torious SQL Slammer worm also relies on control hijacking attacks to duplicate 
and propagate itself epidemically across the net. 

An effective way to defeat control-hijacking attacks is application-based 
anomaly intrusion detection. An application-based anomaly intrusion detection 
system closely monitors the activities of a process. If any activity deviates from 
the predefined acceptable behavior model, the system terminates the process or 
flags the activity as intrusion. The most common way to model the acceptable 
behavior of an application is to use system calls made by the application. The 
underlying assumption of the system call-based intrusion detection is that re- 
mote attackers can damage a victim system only by making malicious system 
calls once they hijack a victim application. Given that system call is the only 
means to inflict damage, it follows logically that by closely monitoring the sys- 
tem calls made by a network application at run time, it is possible to detect and 
prevent malicious system calls that attackers issue, and thus protect a computer 
system from attackers even if some of its network applications have been com- 
promised. While the mechanics of system call-based anomaly intrusion detection 
is well understood, successful application of this technology requires an accurate 
system call model that minimizes false positives and negatives. 

Wagner and Dean [22] first introduced the idea of using compiler to de- 
rive a call graph that can capture the system call ordering of an application. 
At run time, any system call that does not follow the statically derived or- 
der is considered as an act of intrusion and thus should be prohibited. A call 
graph derived from a program’s control flow graph (CFG) is a non-deterministic 
finite-state automaton (NFA) due to such control constructs as if-then-else and 
function call/return. The degree of non-determinism determines the extent to 
which mimicry attack [23] is possible, through so-called impossible paths [22]. 
This paper describes the design, implementation, and evaluation of a Program 
semantics-Aware Intrusion Detection system called Paid, which consists of a 
compiler that can derive a deterministic finite-state automaton (DFA) model 
which captures the system call sites, system call ordering, and partial control 
flow from an application’s source code, and an in-kernel run-time verifier that 
compares an application’s run-time system call pattern against its statically de- 
rived system call model, even in the presence of function pointers, signals, and 
setjmp/longjmp calls. Paid features several unique techniques: 

— Paid inlines each system call site in the program with its associated system 
call stub so that each system call is uniquely labeled by the return address 
of its corresponding int 0x80 instruction, 

— Paid inlines each call in the application call graph to a function having 
multiple call sites with the function’s call graph, thus eliminating the non- 
determinism associated with the exit point of such functions, 

— Paid introduces a notify system call that its compiler component can use to 
inform its run-time verifier component of information that cannot be deter- 
mined statically such as function pointers, signal delivery, and to eliminate 
whatever non-determinism that cannot be resolved through system call in- 
lining and graph inlining, and 
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— Paid inserts random null system calls (which are also part of the system 
call graph) at compile time and performs run-time stack integrity check to 
prevent attackers from mounting mimicry attacks. 

The combination of these techniques enables Paid to derive an accurate DFA 
system call graph model from the source code of application programs, which 
in turn minimizes the run-time checking overhead. However, the current Paid 
prototype has one drawback: it does not perform system call argument analysis. 
But we will include this feature in the next version of Paid. 



2 Related Work 

2.1 System Call-Based Sandboxing 

Many recent anomaly detection systems [22, 8, 18, 13, 9, 24, 15, 17] defines normal 
behavior model using run-time application activities. Although such systems can- 
not stop all attacks, they can effectively detect and stop many control hijacking 
attacks. Among these systems, system call pattern has become the most pop- 
ular choice for modeling application behavior. However, simply keeping track 
of system calls may not be sufficient because it cannot capture other program 
information such as user-level application states. 

Wagner and Dean’s work [22] advocated a compiler approach to derive three 
different system call models, callgraph model (NFA), abstract stack or push- 
down automaton model (PDA), and digraph model. Among all three models, 
the PDA model, which models the stack operations to eliminate the impossible 
paths, is the most precise model, but it is also the most expensive model in 
terms of time and space in many cases. Paid's DFA model represents a signifi- 
cant advance over their work. First, Paid uses notify system call, system call 
inlining, and graph inlining to reduce the degree of non-determinism in the input 
programs. Second, Paid uses stack integrity check and random insertion of null 
system calls to greatly raise the barrier for mimicry attacks. Third, Paid is more 
efficient than Wagner and Dean’s system in run-time checking overhead. For ex- 
ample, for a single transaction, their PDA model took 42 minutes for qpopper 
and more than 1 hour for sendmail, whereas Paid only takes 0.040679 seconds 
for qpopper and 0.047133 seconds for sendmail. 

Giffin et al. [9] extended Wagner’s work to application binaries for secure 
remote execution. They used null system call to eliminate impossible paths in 
their NFA model by placing a null system call after a function call. Paid is 
different from this work because it places a null system call only where non- 
determinism cannot be resolved through graph inlining and system call stub 
inlining. As a result. Paid can use the DFA model to implement a simple and 
efficient runtime verifier inside the kernel. Giffin et al. also tried graph inlining, 
which they called automaton inlining. They found graph inlining increases the 
state space dramatically, but Paid's implementation on Linux only increases 
the state space around 100%. This discrepancy is due to the libc library on 
Solaris. For example, for a single socket call, it only needs a single edge or 
transition on Linux, while it takes more than 100 edges on Solaris. They found 
numerous other library functions that share the same problem. Giffin’s PDA 
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model is similar to Wagner’s model, and they used a bounded-stack to solve the 
infinite stack problem. However, when the stack is full, the PDA model eventually 
becomes a less precise NFA model. Giffin et al. also proposed a Dyck model [10] 
to solve non-determinism problem by placing a null system call before and after 
a function call to simulate stack operation. To reduce performance overhead, a 
null system call from a function call does not actually trap to the kernel if the 
function call itself does not make a system call. 

Behavior blocking is a variation of system call-based intrusion detection. 
Behavior blocking systems run applications in a sandbox. All sandboxed ap- 
plications can only have the privileges specified by the sandbox. Even if an 
application is hijacked, it cannot use more privileges than as specified. Existing 
behavior blocking systems include MAPbox [1], WindBox [3], Janus [11], and 
Consh [2] . The key issue of behavior blocking systems is to define an accurate 
sandboxing policy, which is what Paid is designed for. 

Systems such as StackGuard [6], StackShield [21] and RAD [5,16] tried to 
protect the return addresses on the stack, which are common targets of buffer 
overflow attacks. Non-executable stack [19] prevents applications from executing 
any code on the stack. Another problem is that they cannot prevent attacks 
that target function pointers. IBM’s GGG extension [7] reorders local variables 
and places all pointer variables at lower addresses than buffers. This technique 
offers some protection against buffer overflow attacks, but not buffer underflow 
attacks. Purify [12] instruments binaries to check each memory access at run 
time. However, the performance degradation and the increased memory usage 
are the key issues that prevent Purify from being used in production mode. 
Kiriansky [14] checks every branch instruction to ensure that no illegal code can 
be executed. 



3 Program Semantics- Aware Intrusion Detection 

3.1 Overview 

Paid includes a compiler that automatically derives a system call site flow graph 
or SGSFG from an application’s source code, and a DFA-based run-time verifier 
that checks the legitimacy of each system call. To be efficient, the run-time 
verifier of Paid is embedded in the kernel to avoid the context-switching overhead 
associated with user-level system call monitors. The in-kernel verifier has to be 
simple so that it itself does not introduce additional vulnerabilities. It also needs 
to be fast so as to reduce the performance overheard visible to applications. 
Finally it should not consume much of the kernel memory. The key challenge 
of Paid is to minimize the degree of non-determinism in the derived SGSFG 
such that the number of checks that the run-time verifier needs to perform is 
minimized. 

Once an application’s SGSFG is known, the attacker can also use this infor- 
mation to mount a mimicry attack, in which the attacker follows a legitimate 
path through the application’s SGSFG until it reaches a system call she needs 
to deal the fatal blow. For example, assume an application has a buffer over- 
flow vulnerability and the system call sequence following the vulnerability point 
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is {open, setreuid, write, close, exec}, and an attacker needs setreuid 
and exec for her attack. After the attacker hijacks the application’s control using 
a buffer overflow attack, she can mimic the legitimate system call sequence by 
interleaving calls to open, write and close with those to setreuid and exec 
properly, thus successfully fooling any intrusion detection systems that check 
only system call ordering. To address mimicry attacks, Paid applies two simple 
techniques: stack integrity check and random insertion of null system calls. In 
the next version of Paid, we will add a comprehensive checking mechanism on 
system call arguments as well. 




Fig. 1. Graph interlinking and graph inlining are two alternative to constructing a 
whole-program system call graph from the system call graphs of individual functions 



3.2 Prom NFA to DFA 

The simplest way to construct a call graph for an application is to extract a 
local call graph for each function from the function’s CFG, and then construct 
the final application call graph by linking per-function local call graphs using 
either graph interlinking or graph inlining, which are illustrated in Figure 1. A 
local call graph or an application call graph is naturally an NFA because of 
such control constructs as if-then-else and function call/return. To remove non- 
determinism, we employ the following techniques: 1) system call stub inlining, 
2) graph inlining, and 3) insertion of notify system call. 

One source of non-determinism is due to functions that have many call sites. 
For these functions, the number of out-going edges of the final state of their local 
call graph is more than one, as exemplified by the function foo in Figure 1(B). 
To eliminate this type of non-determinism, we use graph inlining as illustrated in 
Figure 1(C). In the application call graph, each call to a function with multiple 
call sites points to a unique duplicate of the function’s call graph, thus ensuring 
that the final state of each such duplicated call graph have a single out-going 
edge. Graph inlining can significantly increase the state space if not applied 
carefully. We use an e-transition removal algorithm to remove all non-system 
call edges from a function’s CFG before merging the per-function call graphs. 
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Another source of non-determinism is due to control transfer constructs, such 
as for loop, while loop and if-else-then. One example of such problem is 
shown in Figure 1(C), where the program’s control can go to two different states 
from the first state, at which an open system call is made. The reason for this 
non-determinism is that on a Linux system, system calls are made indirectly 
through system call stubs. Therefore, it is not possible to differentiate the open 
system call made in the then branch from that in the else branch. We address 
this problem by uniquely identifying each system call so that when a system call 
is made, the runtime checker knows who it is. More concretely, we inline every 
system call with its associated system call stub so that a system call can be 
uniquely identified with its return address. 

System call stubs inlining does not completely solve the non-determinism 
problem due to flow control constructs, since it does not inline normal functions. 
An example of such non-determinism is shown as follows: 



aO 

{ 

openC) ; 

> 



bO 

aO; 
readO ; 

} 



cO 

{ 

aO; 

close C) ; 

} 



mainO 

{ 

if (true) 
bO; 
else 
cO; 

} 



Since the functions a, b, and c are not inlined, the then branch and the else 
branch of main eventually lead to the same open system call even though the open 
system call stub is inlined. As another example, graph inlining cannot completely 
eliminate all non-determinism for recursive functions with multiple call sites, 
either. Paid introduces a new system call called notify to resolve whatever 
non-determinism is left after graph inlining and system call stub inlining are 
applied. The Paid compiler compiles an application in two passes. The first 
passes generates an NFA, and Paid visits every state of the NFA to detect 
non-determinism. When there is non-determinism, it traces back to the initial 
point that leads to the non-determinism and marks it accordingly. In the second 
pass the compiler inserts a notify call to each marked point to remove the 
corresponding non-determinism, and Anally generates a DFA. Giflin et al [9, 10] 
used a similar idea to eliminate non-determinism due to multi-caller functions. 
However, blindly inserting notify calls can incur high performance overhead. 
Paid only inserts notify calls when non-determinism cannot be resolved by 
system call stub inlining and automaton inlining. 

With system call stub inlining, graph inlining and notify system call inser- 
tion, the Anal call graph generated by the Paid compiler is actually a DFA, which 
we refer to as System Call Site Flow Graph or SCSFG. In addition to system 
call ordering, SCSFG also captures the exact location where each system call 
is made since all system call stubs are inlined. Also the Anal SCSFG does not 
contain any non system call, which reduces the state space dramatically since 
most of the function calls do not contain any system calls. 



3.3 System Call Inlining 

In Linux, system calls are made through system call stubs, and the actual trap 
instruction that transfers control to the kernel is int $0x80 in the Intel X86 
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architecture. Paid inlines each system call site with the associated stub and the 
address of the instruction following int $0x80 becomes the unique label for 
the system call site. Figure 2(B) shows the result of inlining system call sites. 
LABEL jfoo_0001 is the call site of the write system call in the then branch, 
and LABEL_f oo_0003 is the call site of the write system call in the else branch. 
Figure 2(C) shows the SCSFG for the function foo, which includes a unique 
transition edge for each of the write system calls. 



foo (int a) 

{ 

if (a) { 
write ( ) ; 
exec ( ) ; 

} 

else { 

write ( ) ; 

} 

} 



(A) Original Code 



foo (int a) 

{ 

if (a) { 

. . . /*setup arguments*/ 
movl $4, %eax 
int $0x80 /*write*/ 
LABEL_foo_0001 : 

. . . /*handle error number */ 

movl $11, $eax 
int $0x80 /*exec*/ 
Label_foo_002 : 

} 

else { 

. . . /*setup arguments*/ 
movl $4, %eax 
int $0x80 /*write*/ 
LABEL_foo_003 : 

... /*handle error number*/ 

} 

} 

(B) After system Call Stub Inlining 




(C) SCSFG 



Fig. 2. After system call site inlining, the two write system calls can be distinguished 
based on their unique labels, as indicated by LABEL_f oo_0001 and LABEL_f oo_0003 in the 
resnlt of inlining (B). Accordingly, the SCFG in (C) has two different write transition 
edges 



System call site inlining is implemented via GCC front-end’s function inlining 
mechanism. However, to exploit this mechanism, we need to rewrite all system 
call stubs so that they are suitable for inlining. Rewriting system call stubs turns 
out to be a time-consuming task because of various idiosyncrasies in the stubs. 
Some system call stubs such as open, read, and write are actually generated 
through scripts. Other system call stubs such as the exec family, the socket 
family, and clone need to be modified by hand so that they conform to the call 
stub convention used in GLIBG. Finally, the LIBIO library, which replaces the 
old standard I/O library in the new version of GLIBG, turns out to consume 
most of the rewriting effort. Although Paid does not inline normal functions, it 
chose to inline f open and fwrite because they are important to system security. 
However, because fopen and fwrite use other functions in the LIBIO library 
and the actual open and write system call is made through a function pointer, 
we are forced to modify the whole LIBIO library so that function pointers are 
eliminated and the resulting functions are suitable for inlining. 

To force GGG automatically to inline a function, the function has to be 
declared as alwaysJnline, and the function has to be parsed before its callers. 
Therefore, all rewritten system call stubs are declared as alwaysJnline, and are 
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put in a header file called syscall_inline .h. The only modification to GCC for 
system call stub inlining is to make sure that the syscall_inlined.h is always 
loaded and parsed first before other header files and the original source file. 



foo 0 
{ 

write ( ) ; 

} 



fool 0 
{ 

write ( ) 
foo ( ) ; 



main 0 
{ 

if 0 
foo ( ) ; 
else 
fool ( ) 

) 



foo SCSFG fool SCSFG main SCSFG 




Fig. 3. PaidCs. compiler builds a SCSFG for each function, and the SCSFG for main 
represents the SCSFG for the entire program because of graphs inlining 



3.4 Building System Call Site Flow Graph 

Paid needs two passes to compile an application. The two passes are very similar 
except that during the first pass, Paid analyzes the resulting SCSFGs to find 
out where to insert notify calls to eliminate non-determinism. The second pass 
builds the final SGSFGs into a DFA model. Paid’s compiler generates a call site 
flow graph (GSFG) for each function, which contains normal function calls and 
system calls in the function. Then starting with main’s GSFG, Paid’s linker uses 
a depth first search algorithm to build a SGSFG for each function in a bottom 
up fashion. For example, if function fool calls function foo2, foo2’s SGSFG is 
constructed first, and is then duplicated and inlined in the SGSFG of fool. If the 
linker detects a recursive call chain, it inlines every function in the call chain and 
assigns the resulting graph as the call graph of the first function in the chain. 
For example, for the recursive call chain (a->b->c->a), the linker inlines the 
GSFG of c and b at their respective call sites and assigns the expanded graph 
to a. The result of inlining turns a call chain into a self-recursive call. After a 
SGSFG is generated, the linker uses a simple e-transition removal algorithm to 
walk through the whole graph and remove all non-system-call nodes. In the end, 
each function’s SGSFG contains only system call nodes, and an entry node and 
exit node. Because Paid inlines each function call graph, the SGSFG of main 
is the SGSFG for the whole program, which the run-time verifier uses to check 
each incoming system call. Finally, the linker stores all SGSFGs that contain 
at least one system call node, in an . scsfg section of the final ELF executable 
binary. Figure 3 shows a short program with its SGSFGs. It is still necessary to 
store the SGSFGs for individual functions because they may be needed when the 
main SGSFG is amended at run time, as described in the following subsection. 

Many GLIBG functions are written in assembly language. We have to man- 
ually generate the GSFGs for those functions. The technique we use to generate 
GSFGs for functions written in assembly language is to compose dummy skeleton 
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C functions that preserve the call sequence of these functions, and then use the 
Paid compiler to compile the skeleton C functions. The CSFGs for the skeleton 
C functions are by construction the CSFGs for the assembly-code functions. 



void foo{) 
{ 


void foo ( ) 
{ 


main 


foo 


main 


foo 


SCSFG 


SCSFG 


SCSFG 


SCSFG 


write ( ) ; 

} 

main ( ) 


write ( ) ; 
pcO : 

} 


notify 




Q 

wnte 


notify 1 


/ write 


main ( ) 
{ 


pci J 




pcO 1 


pci Jl' 


/ i pco 


{ 

void (*ptr) () = foo; 






@ 


0 


/ © 


ptr 0 ; 


void (*ptr) 0 = foo; 


notify 






notify T / 




} 


notify (1, foo) ; 
pci ; 


pc2 i 






pc2 f / 






notify (2 , ptr) ; 
pc2 : 


r 


© 










ptr ( ) ; 

notify(3, NULL) ; 


notify 


P 




notify 






pc3 : 


pc3 J 






pc3 y 






} 


© 




© 




(A) Original Code 


(B) After Notify Inserted 


(C) Before Notify Call 


(D) After Notify Call 



Fig. 4. This demonstrates how a notify system call associated with a function pointer 
amends an application’s SCSFG at run time. (C) shows the SCSFGs before notify 
is called. (D) shows how notify links the main SCSFG and the SGSFG of the target 
function, foo 



3.5 Run-Time Notification 

The original design of Paid assumes that the control flow graph of a program 
can be completely determined at compile time. This assumption is not valid 
because of branches whose target addresses cannot be determined statically, 
signals, unconventional control transfers such as setjmps/longjmps, etc. To solve 
this problem. Paid introduces a special run-time notification mechanism based 
on a special system call, notify (int flag, unsigned address) , which informs 
the run-time verifier of where the program control currently is at the time when 
the notify system call is made. Paid uses the notify system call for different 
purposes, each of which corresponds to a distinct value in the flag field. With 
the help of notify, Paid's run-time verifier can synchronize with the monitored 
application across control transfers that cannot be determined at compile time. 
Because notify is a system call and is thus also subject to the same sandboxing 
control as other system calls, attackers cannot issue arbitrary notify system 
calls and modify the victim application’s system call pattern. 

Instead of applying pointer analysis. Paid inserts a notify system call before 
every function call that uses a function pointer. The actual value of the function 
pointer variable, or the entry point of the target function, is used as an argument 
to the notify system call, as shown by the second notify call in Figure 4(B). 
When a notify system call that Paid’s compiler inserts because of a function 
pointer traps into the kernel, and the target function’s SGSFG has not been 
linked to the main SGSFG at the current execution point, the run-time verifier 
searches for the SGSFG of the target function and dynamically links the matched 
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SCSFG to the main SCSFG at the current execution point, as demonstrated by 
Figure 4(D). Because any function could potentially be the target of an indirect 
function call, Paid needs to store the SGSFGs for all functions. If a function 
is called from multiple call sites through a function pointer simultaneously, the 
main SGSFG may become an NFA. To avoid this. Paid also inserts a notify 
system call after every function call that uses a function pointer to inform the 
run-time verifier about which return path it should take. However, an attacker 
can overflow a function pointer to point to a desired existing function to exploit 
the attacks similar to return-to-libc attack. To reduce the attack possibilities. 
Paid also inserts a notify call at each location where a function’s entry point is 
assigned to a function pointer, such as shown by the first notify call in Figure 
4(B). The notify call informs the run-time verifier to mark the function, and the 
run-timer never links any unmarked function’s SGSFG to the main SGSFG. 

Set jmp and longjmp functions are for non-local control transfers. To handle 
set jmp/longjmp calls correctly, the Paid compiler inserts a notify call before 
each set jmp call, using the address of the jmp_buf object as an argument. The 
jmp_buf data structure is modified to include a pointer of the notify call in the 
SGSFG. The added pointer is also used to pass the corresponding set jmp return 
address to a notify call. When a notify system call due to setjmp is made, 
the run-time verifier first retrieves the setjmp return address from the jmp_buf 
and stores it in the current notify node, and then the verifier stores the current 
location, which is the address of the current notify node, in the jmp_buf object. 
Paid's compiler also inserts a notify system call before a longjmp call, using 
the address of the jmp_buf as an argument. Upon a notify call due to longjmp, 
the run-time verifier retrieves the previous location from the jmp_buf object and 
links this location to the current location if the saved setjmp return address 
matches the longjmp destination address stored in the jmp_buf . 

Signals are handled differently. For non-blocking signals, the kernel either 
ignores the signal, executes the default handler or invokes the user-supplied 
signal handler. In the first two cases no user code is executed and so no modi- 
fications are needed. However, if the application provides a signal handler, say 
handle_signal, the run-time verifier first creates a new system call node for the 
sigreturn system call, which is pushed on the user stack by the kernel, and 
links the new node to the final node of the SGSFG of handle .signal. Then the 
verifier saves the current SGSFG pointer in the new node, and changes the cur- 
rent SGSFG pointer to the entry node of the SGSFG of handle_signal. Finally, 
the handle_signal is executed. When handle_signal returns back to the ker- 
nel through the sigreturn system call, the run-time verifier restores the current 
SGSFG pointer from the one saved in the node corresponding to the sigreturn 
system call, and proceeds as normal. 

3.6 Stack Integrity Check 

System call stub inlining and sandboxing based on system call sites/ordering 
constraints force an attack code to jump to the actual system call sites in the 
program in order to issue a system call. However, when control is transferred 
to a system call site, it is not always possible for the attack code to grab the 
control back. To further strengthen the protection. Paid's run-time verifier also 
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checks the stack integrity by ensuring that all the return addresses in the stack 
are proper when a system call is made. If any return address is outside the orig- 
inal text segment, this indicates a control-hijacking attack may have occurred, 
because Paid explicitly forces all code regions to be read-only so that no at- 
tack code could be inserted into these regions. This simple stack integrity check 
greatly reduces the room that mimicry attacks have for maneuver as most such 
attacks need to make more than one system calls. Although GCC uses stack for 
function trampolines for nested functions, it does not affect the stack integrity 
check since function trampoline code does not make function calls. Also before 
executing a signal handler, Linux puts the sigreturn system call on the user 
stack and points the signal handler’s return address to the sigreturn system 
call code in the stack. However, Paid can easily detects such return addresses by 
examing the code on the stack since the sigreturn invocation code is always 
the same. 

3.7 Random Insertion of Null System Calls 

To further improve the detection strength of Paid , Paid randomly inserts some 
null system calls into an application. A null system call is a notify system call 
that does not perform any operation. Paid randomly chooses some functions that 
do not lead to any system call between two consecutive system calls in a function 
call sequence to insert random notify call. For example, for the call sequence 
{write, buf , a, b, c, exec}. Paid may insert a notify call in the function 
a so that the new call sequence would be {write, buf, a, notify, b, c, 
exec}. Assuming that a buffer overflow happens in the function buf, the attack 
code cannot call exec directly since notify is before exec in the sequence, and 
the notify call is also included in the SCSFG. If the attacking code wants to 
call exec, it has to setup the stack to regain the control after it makes a notify 
call. However, since the verifier checks the stack when the notify is made, it 
detects the illegal return address on the stack and terminate the process. To 
make the remote attacks more difficult, one may be willing to recompile server 
applications and hide the binaries from the remote attackers. Since notify calls 
are inserted randomly, it is highly unlikely that an attack code can guess their 
existence and their call sites. Inserting null system calls provides the run-time 
verifier more observation points to monitor an application. Together with stack 
integrity check, it also forces attack code to follow more closely the application’s 
original control flow. 

To randomly insert null system call, the first compilation pass analyzes the 
initial version of the SGSFGs and determines where to insert null system calls, 
and the second pass builds the final SGSFGs with random notify calls inserted. 
The exact algorithm for inserting null system calls in the current Paid prototype 
works as follows. For each two system calls A and B in an application SGSFG, 
if there exist a path from A to B where there is no other system call. Paid 
randomly chooses some functions that do not lead to any system call on the 
path to insert null system call. If the number of functions on a path is 2 to 4, 
the compiler randomly inserts a null system call into one of the functions. If the 
number of function calls is 5 to 7, it randomly inserts 2 null system calls; if the 
number is 8 and up, it inserts 3. Many paths may go through a same function; 
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however, Paid never inserts more than one null system calls in a function, and 
it always make sure that a null system call is inserted outside any loop. 

Potentially an attacker can examine the binary to deduce the locations of 
these randomly inserted notify system calls. In practice, such attacks are un- 
likely because it is well known that perfect disassembly is not impossible on X86 
architecture [16], due to the fact that distinguishing data and code is funda- 
mentally undecidable, and Paid can randomly insert some dummy data into the 
instruction/text segment as an anti-disassembler technique. 



3.8 Run-Time Verifier 

Linux’s binary loader is modified to load the . scsf g segment of an ELF binary 
into a randomly chosen region of its address space. That is, although the verifier 
performs system call-based sandboxing inside the kernel, the SCSFGs used in 
sandboxing reside in the user space, rather than in the kernel address space. 
To prevent attackers from accessing SCSFGs, the region at which the SCSFGs 
are stored is chosen randomly at load time. In addition, the SCSFG region is 
marked as read-only, so that it is not possible for an attacker to modify them 
without making system calls, which will be rejected because Paid checks every 
system call. When the run-time verifier needs to amend the main SCSFG due 
to a notify system call, it has to make the SCSFG region writable, and turns it 
back to read-only after the amendment. However, these operations are performed 
inside the kernel, and thus may override the region’s read-only marking. 

An execution pointer is added into Linux’s task_struct to keep track of 
a process’s current progress within its associated SCSFG. For each incoming 
system call, the verifier compares the system call number and the call site with 
each child of the node pointed to by the execution pointer. If a match is found, 
the execution pointer is moved to the matched node; otherwise, it indicates that 
an illegal system call is made, and the verifier simply terminates the process. 
Because a program’s main SCSFG is a DFA, only a single execution pointer 
is needed and the SCSFG graph traversal implementation is extremely simple, 
only 45 lines of C code. Accordingly, the performance overhead of the run-time 
verifier is very small, as demonstrated in Section 5. 



3.9 Support for Dynamically Linked Libraries 

Paid treats the dynamically linked libraries as static linked libraries. It statically 
builds the SCSFGs for each DLL used by an application, and inlines the DLL 
SCSFGs to the main SCSFG of the application at the static linking time. How- 
ever, for each DLL, there is a table to hold all call site addresses in the .scsfg 
segment. If a DLL is loaded at a different address than the preferred address, the 
loader will fix all its call site addresses in the associated table at load time. For 
libraries loaded by dlopen at run time, the pre-built SCSFGs of these libraries 
are copied to the application’s own address space after the libraries are loaded. 
All library function calls via function pointers that are obtained by dlsym also 
rely on notify system calls, with an additional burden to fix call site addresses 
if a library is not loaded in the preferred location. The disadvantage of statically 




Automatic Extraction of Accurate Application-Specific Sandboxing Policy 



13 



inlining the SCSFGs of the DLLs is that the resulting binaries cannot work with 
different versions of the same libraries. 



3.10 Support for Threads 

To support a multi-threaded application, Paid needs to maintain an execution 
pointer for each thread, and applies a mechanism similar to the way set jmp/ 
longjmp is handled to switch the execution pointer at run time. At this point 
Paid does not support applications using user-level threads. However, this is not 
a limitation because most multi-threaded applications use kernel- level threads, 
which Linux supports through the clone system call. Although many appli- 
cations use pthread, the pthread library of GLIBC actually uses kernel-level 
threads as well on Linux. We modified the clone and fork system calls so that 
the SGSFG region is copied when they are called. 




immediately after a buffer overflow replacement 



Fig. 5. This figure shows the two cases that Paid cannot handle. Figure (A) shows an 
attack that only needs a single system call and the desired system call is immediately 
after a buffer overflow. Figure (B) shows the argument replacement attack assuming 
that the arguments needed by exec are directly passed from the function m to the 
system call exec through the call chain. After the buffer overflow, the injected code 
sets up the desired arguments for the foo function, and then calls foo directly 



4 Attack Analysis 

Even though Paid uses several techniques to reduce the feasibility of mimicry 
attacks, there are two cases in which mimicry attacks are still possible as shown 
in Figure 5. First, if an attack code can compromise an application and then 
directly issue a damaging system call without requiring getting the control back, 
for example, when an exec system call immediately follows a buffer overflow 
vulnerability. Paid cannot stop this type of attacks. Second, if an attacker can 
mount an attack without making system calls explicitly, for example, by just 
manipulating the system call arguments of one or multiple legitimate system calls 
already in the program. Paid cannot stop this type of attacks, either. However, 
to the best of our knowledge these two types of attacks are very rare in practice. 
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Stack integrity check helps Paid to ensure that the stack state at the time 
a system call is made is proper. However, there is no guarantee that the moni- 
tored application indeed goes through the function call sequence as reflected in 
the chain of return addresses, because the attacker can easily doctor the stack 
frames before making any system call just to fool the stack integrity checking 
mechanism. None the less, stack integrity check at least greatly complicates the 
attack code, while incurring a relatively modest performance overhead, as shown 
in Section 6. 

Because Paid embeds an application’s SCSFG inside its binary file, and 
checks the application’s run-time behavior against it, some ill-formed SCSFGs 
may cause Paid's run-time verifier to misbehave, such as entering an infinite 
loop, thus leading to denial-of-service attacks. To the best of our knowledge, 
Paid's run-time verifier does not have such weaknesses. As a second line of de- 
fense, Paid can perform run-time checks only for trusted applications, in a way 
similar to how only certain applications have their setuid bit turned on. Finally, 
the run-time verifier can include some self-monitoring capability, so that it can 
detect anomalous system call graphs and terminate the process. 



5 Performance Evaluation 

5.1 Prototype and Methodology 

The current Paid compiler is derived from GGG 3.1 and GNU Id 2.11.94 (linker), 
and runs on Red Hat Linux 7.2. The Paid prototype can successfully compile 
the whole GLIBG (version 2.2.5), and production-mode network server programs 
implemented using fork or clone system call, such as Apache and Wu-ftp. 
For this study, we used as test programs the set of network server applications 
shown in Table 1, and compared Paid's performance and space requirement 
with those of GGG 3.1 and Red Hat Linux 7.2, which represent the baseline 
case. To analyze detailed performance overhead, we conducted each experiment 
in four different configurations: plain Paid that only uses SGSFGs (plain Paid 
), Paid with stack integrity check (Pazd/stack), Paid with random insertion of 
null system calls (Pazd/random), and Paid with both stack integrity check and 
random null system calls (Pazd/stack/random). 

To test the performance of each server program, we used two client machines 
to continuously send 2000 requests to the tested server program. In addition, 
we modified the server machine’s kernel to record the creation and termination 
time of each process. The throughput of a network server application is calculated 
by dividing 2000 by the time interval between the creation of the first forked 
process and the termination of the last forked process. The latency is calculated 
by taking the average of the run time used by the 2000 forked processes. The 
Apache web server program is handled separately in this study. We configured 
Apache to handle each incoming request with a single child process so that we 
could accurately measure the latency of each Web request. 

The server machine is a 1.5-GHz P4 with 256MB memory, one client machine 
is a 300-MHz P2 with 128MB memory and the other client is a 1.1-GHz P3 with 
512MB memory. They are connected through an isolated 100Mbps Ethernet 
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link. All machines run Redhat Linux 7.2. To test http and ftp servers, the client 
machines continuously fetched a 1-KByte file from the server, and the two client 
programs were started simultaneously. In the case of pop3 server, the clients 
checked mails and retrieved a 1-KByte mail from the server. For the sendmail 
server, two clients continuously sent a 1-KByte mail to two different users. All 
client programs used in the test were modified in such a way that they contin- 
uously sent 1000 requests to the server. A new request was sent only after the 
previous one was completely finished. To speed up the request sending process, 
client programs simply discarded the data returned from the server. All network 
server programs tested were statically linked, and the modified GLIBC-2.2.5 
library was recompiled by the Paid compiler. 

5.2 Effectiveness 

Two small programs with buffer overflow vulnerability were used to test the 
effectiveness of Paid in stopping attackers from making unauthorized system 
calls. The first program allows attackers to overflow the return address of one 
of its functions and point it to a piece of injected code that makes malicious 
system calls. Paid’s run-time checker successfully stopped the execution of the 
program since the injected system calls were not the next valid system calls in 
the original program’s main SCSFG. The second program allows attackers to 
overflow a function pointer to point to a piece of dynamically injected code. 
Existing buffer overflow defense systems such as Stackguard/Stackshield and 
RAD cannot handle this type of attacks. However, the notify system call in Paid 
successfully detects this attack because the SGSFG of the injected code was not 
found in the original program’s . scsfg segment. We also tested Paid on the 
“double free” vulnerability of wu-ftpd-2.6.0. The exploit script is based on the 
program written by TESO Security [20] . Again Paid is able to successfully stop 
this attack while the exploit successfully spawns a root shell from the ftpd that 
is compiled by the original GGG compiler. 

5.3 Performance Overhead 

Paid adds an extra .scsfg segment to an application’s binary image to store 
all SGSFGs. The size of the . scsfg segment is expected to be large because it 



Table 1. Characteristics of a set of popular network applications that are known to 
have buffer overflow vulnerability. The source code line count includes all the libraries 
used in the programs, excluding libc 



Program Name 


Lines of Code 


Brief Description 


Qpopper-4.0 


32104 


Pop3 server 


Apache-1.3.20 


51974 


Web server 


Sendmail-8.11.3 


73612 


Mail server 


Wu-ftpd-2.6.0 


28055 


Ftp server 


Proftpd-1.2.8 


58620 


Ftp server 


Pure- ftpd-1. 0.14 


28182 


Ftp server 
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Table 2. The binary image size and compilation time overhead of Paid compared with 
GCC. The absolute size of the . scsf g segment of each network application in bytes is 
also listed 



Program Name 


Plain Paid 
Binary Size 
Overhead 


Plain Paid 
SCSFGs 
Size 


Plain Paid 
Compile Time 
Overhead 


Random Paid 
Binary size 
Overhead 


Random Paid 
SCSFGs 
Size 


Random Paid 
Compile Time 
Overhead 


Qpopper-4.0 


124 . 86 % 


798,780 


119 . 30 % 






121 . 00 % 


Apache- 1.3.20 


156 . 38 % 


1 , 338,739 








247 . 92 %! 


Sendmail-8.11.3 


153 . 87 %) 


1 , 845,208 


198 . 15 % 






211 . 53 % 








164 . 73 % 


173 . 04 %, 


1 , 253,412 


197 . 45 % 


Proftpd-1.2.8 


178 . 62 % 


1 , 590,262 


169 . 84 %, 


194 . 12 % 


1 , 738,270 


194 . 18 %! 


pure-ftpd- 1.0.14 


116 . 27 % 


680,820 


100 . 12 % 


128 . 85 %, 


755,304 


120 . 85 % 



contains the SCSFGs of all the functions in the application as well as in the li- 
braries that the application links to. Because of the . scsf g segment, and because 
of system call inlining, the binary image of a network application compiled under 
the Paid compiler is much larger than that compiled under GCC. The binary 
image space overhead of the test applications ranges from 116.27% to 178.02% 
for the applications compiled by plain Paid, and from 128.85% to 194.12% for 
the applications compiled by Paid with random null system calls, as shown in 
Table 2. Most of the space overhead is indeed due to the .scsfg segment. The 
absolute size of this segment for each of the test network applications is also 
shown in Table 2. Note that this increase in binary size only stresses the user 
address space, but has no effect on the kernel address space size. 

The Paid compiler also needs more time to extract the SCSFGs. Table 2 
shows that the additional compilation time overhead of Paid when compared 
with GCC is from 100.12% to 232.98% for plain Paid, and from 120.85% to 
247.92% for Pozd/random. UnderPazd’s compiler, compilation of applications 
takes two passes. Table 2 only shows the compilation time overhead of the second 
pass. 



Table 3. The latency penalty of each network application compiled under Paid with 
different configurations when compared with the baseline case 



Program 


paid 

Latency 

Penalty 


paid/stack 

Latency 

Penalty 


paid / random 
Latency 
Penalty 


paid/ stack / rand 
Latency 
Penalty 


Qpopper-4.0 


5 . 69 % 


5 . 84 % 


6 . 42 % 


6 . 63 % 


Apache-1.3.20 


5 . 14 % 


5 . 70 % 


6 . 93 % 


7 . 63 % 


Sendmail-8.11.3 


7 . 31 % 


8 . 38 % 


10 . 32 % 


11 . 66 % 


Wu-ftpd-2.6.0 


2 . 28 % 


2 . 76 % 


3 . 73 % 


4 . 58 % 


Proftpd-1.2.8 


6 . 85 % 


7 . 63 % 


8 . 55 % 


9 . 85 % 


pure- ftpd-1. 0.14 


4 . 80 % 


5 . 33 % 


5 . 10 % 


7 . 58 % 



The performance overhead of Paid mainly comes from the additional check 
at each system call invocation. More specifically, it involves stack integrity check 
and the decision logic required to move to next DFA state. We measured the 
average latency penalty at each system call due to this check, and the results in 
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Table 5 show that this penalty is between 5.43% to 7.48%. However, the overall 
latency penalty of plain Paid compared to the base case (GCC -I- generic Linux) 
is smaller, as shown in Table 3 and Table 4, because each network application 
also spends a significant portion of its run time in the user space. As a result, 
the overall latency penalty of plain Paid ranges from 2.28% (wu-ftpd) to 7.31% 
(sendmail), and the throughput penalty ranges from 2.23% (Wu-ftpd) to 6.81% ( 
Sendmail) . These results demonstrate that despite the fact that Paid constructs 
a detailed per-application behavior model and checks it at run time, its run-time 
performance cost is really modest. This is also true for the Pazd/stack/random 
configuration, which includes both stack check and random insertion of null 
system calls. The latency penalty for Pozd/stack/random is from 4.58% (Wu- 
ftpd) to 11.66%(Sendmail), and the throughput latency penalty ranges from 
4.38% (Wu-ftpd) to 10.44% (Sendmail). 

Compared with plain Paid, Pazd/Stack only increases the performance over- 
head slightly as shown in Table 3 and Table 4. The latency penalty of Pazd/Stack 
ranges from 2.76% to 8.38%, and the throughput penalty ranges from 2.69% to 
7.73%. This shows that checking the stack integrity at every system call is a 
relatively inexpensive verification mechanism. In contrast. Paid /random incurs 
more overhead than Pazd/stack, because each null system call inserted incurs ex- 
pensive context switching overhead. As the number of null system calls inserted 
increases, this overhead also increases, but the strength of protection against 
mimicry attacks also improves as attack codes are forced to follow more closely 
the application’s original execution flow. The latency penalty for Pozd/random 
ranges from 3.73% to 10.32%, and the throughput penalty ranges from 3.60% to 
9.36%. 



Table 4. The throughput penalty of each network application compiled under Paid 
with different configurations when compared with the baseline case 



Program 


paid 

Throughput 

Penalty 


paid/stack 

Throughput 

Penalty 


paid /random 
Throughput 
Penalty 


paid /stack /rand 
Throughput 
Penalty 


Qpopper-4.0 


5 . 38 % 


5 . 52 % 


6 . 03 % 


6 . 22 % 


Apache-1.3.20 


4 . 89 % 


5 . 39 % 


6 . 48 % 


7 . 09 % 


Sendmail-8.11.3 


6 . 81 % 


7 . 73 % 


9 . 36 % 


10 . 44 % 


Wu-ftpd-2.6.0 


2 . 23 % 


2 . 69 % 


3 . 60 % 


4 . 38 % 


Proftpd-1.2.8 


6 . 41 % 


7 . 10 % 


7 . 87 % 


8 . 96 % 


pure- ftpd-1. 0.14 


4 . 58 % 


5 . 06 % 


4 . 85 % 


7 . 05 % 



One of the major concerns early in the development cycle of this project 
is the performance overhead associated with notify system calls, which are 
used to amend the main SCSFG at run time when functions are called through 
function pointers or other dynamic control transfers that cannot determine at 
compile time. Amending a SGSFG may take a non-negligible amount of time. In 
addition, the SGSFG region and the SGSFG’s current execution point need to 
be copied to the child process as part of fork or clone. Gopying of the SGSFG 
region actually does not incur serious performance overhead because Linux uses 
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copy-on- write. However, when notify system calls are made, the SCSFG needs 
to be modified, and additional data copying is required to implement copy-on- 
write. 

To study the impact of notify system calls due to function pointers on the 
application performance, we instrumented the Paid compiler and the run-time 
verifier to measure the notify system call frequency. Table 5 lists the number 
of notify calls inserted into each tested network application and the modified 
GLIBC statically. We also collected the actual number of notify calls made 
at run time. The number of dynamic notify calls includes both calls from the 
applications and the modified GLIBG. If the SGSFG of a function has been 
linked to the main SGSFG by a notify call at the same call site previously, the 
verifier does not need to search for the SGSFG again. The last column of the 
table shows the number of notify calls that actually need a full-scale search 
through the SGSFGs, or the number of notify cache misses. 

The number of statically inserted notify calls ranges from 2 (wu-ftpd) to 
247 (Apache), and the number of actual calls ranges from 92 (pure-ftpd) to 673 
(Proftpd). The number of notify calls that need full SGSFG search ranges from 
7 (wu-ftpd) to 33 (Proftpd). Even though Proftpd makes 673 notify calls at 
run time, its latency penalty is 9.85% and the throughput penalty is 8.96%. 
This low overhead mainly comes from the surprising low overhead associated 
with notify system calls. When SGSFG search is not needed, a notify system 
call only needs 1,745 GPU cycles. The average time requirement for a notify 
system call that needs SGSFG search is 3,383 GPU cycles. However, most of 
notify system calls do not need SGSFG search. 



Table 5. The average per-system call latency penalty, the number of static and dy- 
namic notify system calls, and the number of dynamic notify system calls that need 



full-scale SCSFG search 



Program Name 


Average System 
Call Overhead 


Static Notify 
Count 


Dynamic Notify 
Count 


Notify Calls Need to 
Search for SCSFG 


Qpopper-4.0 


7.48% 


111 


49 


7 


Apache- 1.3.20 


5.61% 


247 


181 


10 


Sendmail-8.11.3 


7.06% 


69 


386 


16 


Wu-ftpd-2.6.0 


6.09% 


2 


99 


7 


Proftpd-1.2.8 


6.89% 


149 


673 


33 


Pure- ftpd-1. 0.14 


5.43% 


4 


92 


23 


modifled-GLIBC 


N/A 


604 


N/A 


N/A 



6 Conclusion 

System call-based intrusion detection provides the last line of defense against 
control-hijacking attacks because it limits what attackers can do even after they 
successfully compromise a victim application. The Achilles’ heel of system call- 
based intrusion detection is how to efficiently and accurately derive a system 
call model that can be tailored to individual network applications. This paper 
describes the design, implementation, and evaluation of Paid, a fully operational 
compiler-based system call-based intrusion detection system that can automat- 
ically derive a highly accurate system call model from the source code of an 
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arbitrary network application. One key feature of Paid is its ability to exploit 
run-time information to minimize the degree of non-determinism that is inher- 
ent in a pure static analysis approach to extracting system call graph. The other 
unique feature that sets Paid apart from all existing system call-based intrusion 
detection systems including commercial behavior blocking products, is its ap- 
plication of several techniques that together reduce the vulnerability window to 
mimicry attacks to a very small set of unlikely program patterns. As a result, 
we believe Paid represents one of the most comprehensive, robust, and precise 
host-based intrusion detection systems that are truly usable, scalable, and ex- 
tensible. Performance measurements on a fully working Paid prototype show 
that the run-time latency and throughput penalty of Paid are under 11.66% 
and 10.44%, respectively, for a set of popular network applications including the 
Apache web server, the Sendmail SMTP server, a Pop3 server, the wu-ftpd FTP 
daemon, etc. Furthermore, by using a system call inlining technique. Paid dra- 
matically reduces the run-time system call overhead. This excellent performance 
improvement mainly comes from the fact that the SCSFGs that Paid's compiler 
generates is a DFA. 

Currently, we are working on extending the Paid prototype in the following 
directions. First, we are developing compiler techniques that can capture system 
call arguments whose values can be statically determined or remain fixed after 
initialization. Being able to check system call arguments further shrinks the 
window of vulnerability to control-hijacking attacks. Second, we are exploring 
the feasibility of applying the same security policy extraction methodology on 
binary programs directly, so that even legacy applications whose source code is 
not available can enjoy the protection that Paid can provide. 
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Abstract. Many intrusions amplify rights or circumvent defenses by issuing sys- 
tem calls in ways that the original process did not. Defense against these attacks 
emphasizes preventing attacking code from being introduced to the system and 
detecting or preventing execution of the injected code. Another approach, where 
this paper fits in, is to assume that both injection and execution have occurred, 
and to detect and prevent the executing code from subverting the target system. 
We propose a method using waypoints: marks along the normal execution path 
that a process must follow to successfully access operating system services. Way- 
points actively log trustworthy context information as the program executes, al- 
lowing our anomaly monitor to both monitor control flow and restrict system call 
permissions to conform to the legitimate needs of application functions. We de- 
scribe our design and implementation of waypoints and present results showing 
that waypoint-hased anomaly monitors can detect a subset of mimicry attacks and 
impossible paths. 

Keywords: anomaly detection, context sensitive, waypoint, control flow moni- 
toring, mimicry attacks, impossible paths 



1 Introduction 

Common remote attacks on computer systems have exploited implementation errors to 
inject code into running processes. Buffer overflow attacks are the hest-known example 
of this type of attacks. For years, people have been working on preventing, detecting, 
and tolerating these attacks [1-13]. Despite these efforts, current systems are not secure. 
Attackers frequently find new vulnerabilities and quickly develop adaptive methods that 
circumvent security mechanisms. 

Host-based defense can take place at one of three stages: preventing code injection, 
preventing execution of the injected code, and detecting the attack after the injected 
code has begun execution. One class of detection mechanisms, execution-monitoring 
anomaly detection, compares a stream of observable events in the execution of a running 
process to a prohle of “known-good” behavior, and raises alerts on deviations from the 
profile. While it is possible to treat each instruction executed by the process as an event 
for comparison to the prohle, typical anomaly detectors use system calls [6, 14-17] or 
function calls [5, 18] as the granularity for events. 

* This work was supported in part hy a Syracuse University Graduate Fellowship Award. 
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We focus our efforts on detecting attempts to subvert the system through the kernel 
API (the system call interface), assuming the attacking code has started to run. We 
monitor requests for system services (i.e., system calls) of running processes, and detect 
anomalous requests that could not occur as a result of executing the code in the original 
binary program image. 

Two major problems that system-call based anomaly detection faces are mimicry 
attacks [12, 19] and impossible paths [12], A mimicry attack interleaves the real attack- 
ing code with innocuous code, thereby impersonating a legitimate sequence of actions. 
For example, if the legitimate code has the system call sequence 

getuidO ... openO ... execveO 

and the attack has the sequence 

getuidO ... execveO 

the attacker can add a “no-op” system call to match the legitimate attack sequence: 

getuidO ... open (" /dev/ null "... ) ... execveO. 

We further divide mimicry attacks into global mimicry attacks and local mimicry 
attacks . Considering the minimum set of system calls necessary for the functionality 
of an application function, the system call sequence in a global mimicry attack com- 
bines the legal system calls of multiple functions, while a local mimicry attack uses the 
legal system calls of only the running function. 

An impossible path is a control path (with a sequence of system calls) that will never 
be executed by the legal program, but is a legal path on the control flow graph of the 
program. Impossible paths can be generated due to the nature of the non-deterministic 
finite state automata (NDFSA). For example, when both location A and B can call func- 
tion f ( ) , function f ( ) can return to either location A or B. The call graph for the 
program allows a return-into-others impossible path wherein location A calls function 
f ( ) , but the return goes to location B, which behavior appears legal in the control flow 
graph. This example attack is similar to a return-into-lib(c) attack in that both of them 
modify the legal control path at the function return points. 

This paper introduces our use of waypoints^ for assisting anomaly monitoring. 
Waypoints are kernel-supported trustworthy markers on the execution path that the 
process must follow when making system calls. In this paper, we use function-level 
scoping as the context of a waypoint. If function C calls function D, then the active con- 
text is that of D; upon function return, the active context is again that of C. Waypoints 
provide control flow context for security checking, which supports call flow checking 
approaches such as that in Feng, et al. [5] and allows us to check whether the process 
being monitored has permission to make the requested system call in the context of the 
current waypoint. 

' This terminology is borrowed from route planning using a GPS system, while the term is as 
old as navigation systems, meaning a specific location saved in the receiver’s memory that is 
used along a planned route. 
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The work presented in this paper makes the following contributions: 

1 . Kernel-supported waypoints provide fine-grained trustworthy context information 
for on-line monitoring. Using this information, we can restrict a process to access 
only those system calls that appeared in the original program fragment associated 
with the waypoint context. 

Waypoints can change the granularity of intrusion detection systems that monitor 
system call sequences. The more waypoints we set between two system calls, the 
more precise control of that program path we can provide to the detector. 

2. Using the context information, our anomaly monitor can detect global mimicry 
attacks that use permissions (i.e. allowed system calls) across multiple functions. 
Any system service request falling out of the permission set of the current context 
is abnormal. 

3. Our anomaly monitor can detect return-into-others impossible paths attacks. We 
use waypoints to monitor the function call flow and to guarantee that callees return 
to the right locations. 

In the next section, we describe our model of attacks in detail. Section 3 describes 
our design and implementation of waypoints and the waypoint-based system call mon- 
itor. In section 4 we present performance measurements of our approach. Section 5 
summarizes related work. Section 6 discusses the limitations and our future work and 
gives our conclusions. 



2 Attack Models 

Once the exploit code has a chance to run, it can access the system interface in the 
following three ways, which we present in order of increasing code granularity: 

1. Jumping to a system call instruction, or a series of such instructions, within the 
injected code itself. Many remote attacks use shellcode - a piece of binary code 
that executes a command shell for attackers [20]. Most shellcode issues system 
requests directly through pre-compiled binary code. In this case, the attacker relies 
on knowing the system call numbers and parameters at the time he compiles the 
attack code, which, in the presence of a near monoculture in system architecture 
and standardized operating systems, is a reasonable assumption. The control path 
in this case is fully under the control of the attacker, as he controls the location of 
the sensitive system call at the time of the code injection. 

2. Transferring control to legitimate code elsewhere in the process; the target code can 
be at any link on the path to the system call instruction. The attacking code achieves 
its goal by executing from the target instruction forward to the desired system call. 
The attack can be achieved either by creating fake parameters and then jumping 
to a legitimate system call instruction, (e.g., making it appear that the argument to 
an existing execve ( ) call was " /bin/ sh"), or by jumping to a point on the 
path leading to an actual call (e.g. execve ( "/bin/ sh" ) ) in the original pro- 
gram. Locations used in the latter attack include a system call wrapper function in 
a system library such as libc, an entry in the procedure linkage table (PUT) using 
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the address in the corresponding global offset table (GOT), or an instruction in an 
application function that leads to calling the above entries. This is general form of 
the return- into- lib (c) attack [21], in which the corrupted return address 
on the stack forces a control transfer to a system call wrapper in the libc library. 
For the remainder of this paper we will refer to this type of attack as an low-level 
control transfer (LCT) attack. In contrast to defending against the shellcode attack, 
it is of paramount importance to protect the control path when defending against an 
LCT attack. 

3. Calling an existing application function that performs the system call(s) that the at- 
tacking code requires. While this is a form of control transfer attack, we distinguish 
it from the LCT because the granularity of the attack is at the application function 
level, not at the level of the individual instruction or system call. In this case, the 
control path is the sequence of application-level function invocations leading to the 
function that contains the attacking call. 

Mimicry attacks can be achieved by directly jumping to injected code that mimics a 
legal sequence of system calls or calling a sequence of lib(c) functions, which fall in the 
above category 1 and 2 attacks. Attackers can also use category 3 attacks (i.e. calling 
existing application functions), but this is easier to detect than the category 1 and 2 
attacks by using call flow monitoring techniques. Attackers can also explore impossible 
paths to elude detection by using the above three categories attacking techniques. 

While function call flow monitoring can reduce attacks in category 3, and non- 
executable data sections can block attacks in category 1, attacks using category 2 tech- 
niques are more difficult to detect because they use legitimate code to achieve malicious 
purposes. An important characteristic that attackers use is that the default protection 
model permits programs to invoke any system call from any function, but in actuality 
each system call is only invoked from a few locations in the legal code. While some 
previous work has exploited the idea of binding system calls or other security sensitive 
events with context [5, 18,22-24], this paper explores this approach further. We intro- 
duce the concept of waypoints to provide trustworthy control flow information, and 
show how to apply the information in anomaly detection. 

3 Waypoint-Based System Call Access Control 

We observe that an application function - a function in an application program, not a 
library function - in general uses only a small subset of system service routines^, but 
has the power to invoke any system call under the default Unix protection model. This 
practice violates the principle of least privilege, which restricts a function to only invoke 
systems calls that are necessary for its proper execution. For example, execve ( ) is 
not used by many legitimate functions, especially in setuid root regions, but it is com- 
mon for exploit code to invoke that system call within the scope (or equivalently, the 
stack frame) of any vulnerable function. Waypoints provide a mechanism for restricting 
program access to system calls and enforces least privilege. 

^ For example, in table 1 and 2 of section 3.3, only 3 out of 416 application functions being 
monitored require execve ( ) legally. 
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3.1 Waypoint Design 

A waypoint, located in the code section, is a trustworthy checkpoint on control flow. 
Waypoints can actively report control flow information in real-time to assist intrusion 
detection, rather than gathering the information only at system call time. People can 
assign security attributes to each waypoint or to a sequence of waypoints. 

To achieve our goals, waypoints must embody the following properties: 

1. Authentication 

Because we assume that an attack has successfully started executing, and the attack 
has the right to access the whole process image, it is possible that the attacking 
code can overwrite code pointers. Although the code section is usually read-only, 
dynamically-generated code will be located in memory with both read and write 
permissions. This means that attackers have the ability to generate waypoints within 
their own code, and we must therefore authenticate waypoints. 

We authenticate the waypoints by their locations. Waypoints are deployed before 
the process runs, such that the waypoint locations are registered at program loading 
time. In this way, we can catch false waypoints generated at run time. 

2. Integrity 

Because attackers can access the whole process image, information generated at 
and describing waypoints (e.g., their privileges) should be kept away from mali- 
cious access. We store all waypoint-related data and code in the kernel. 

3. Compatibility 

Our waypoints work directly on binary code, so the original code may be generated 
from different high-level languages or with different compilers. 

A natural granularity for control flow monitoring is at the function level. To trace 
function call flow, we set up waypoints at function entrance and exit. We generate way- 
points and their associated permissions on a per-function basis through static analysis. 

At run time, we can construct a push-down automata of the waypoints that parallels 
the execution stack of the process. An entrance waypoint disables the permissions of 
the previous waypoint, pushes the entrance waypoint on top of the waypoint stack, and 
enables its permissions. A corresponding exit waypoint discards the top value on the 
waypoint stack and restores the permissions of the previous waypoint. 

It is possible that we assign different permissions to different parts of a function. In 
this case, we need a middle waypoint . A middle waypoint does not change the waypoint 
stack. It only changes permissions for the waypoint. 

We deploy waypoints only in the application code. Although we do not set way- 
points in libraries, we are concerned about library function exploitation. We treat se- 
curity relevant events (system requests) triggered by library functions as running in the 
context of the application function that initiated the library call(s). 

The waypoint stack records a function call trace. Using this context information, 
our access monitor can detect attacks in two ways: (1) flow monitoring - Globally, 
waypoints comprise the function call trace for the process. We can construct legal way- 
point paths for some security critical system requests (e.g. execve ( ) ), such that when 
such a system call is made, the program must have passed a legal path. Similar ideas 
on control flow monitoring have been proposed in [5, 25], therefore, we do not discuss 
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this approach further in this paper. (2) permission monitoring - Locally, we use static 
analysis to determine the set of system calls (permissions) required for each function, 
and ensure system calls invoked in the context of a function appears in its permission 
set. 

3.2 Waypoint Implementation 

If a piece of code needs to perform a system request legally, then we say that the piece 
of code has a permission to issue the system request. To simplify the implementation, 
we use a set to describe permissions for a waypoint and store the permission sets in a 
bitmap table. 

We generate waypoints and their corresponding permissions through static analysis. 
We introduce global control flow information by defining the number of times that a 
function can be invoked. Usually, an application function does not issue system requests 
directly. It calls system call wrappers in the C library instead. The application may 
call the wrapper functions indirectly by calling other library functions first. We build 
a (transitive) map between system call wrappers and system call numbers. Currently, 
we analyze the hierarchical functions manually. Our next step is to automate this whole 
procedure. 

We deploy the access monitor, together with the waypoint stack and the permission 
bitmap table, in the operating system kernel, as shown in figure 1 . There are two fields 
in an entry of the waypoint stack, one is the location of the waypoint, the other is extra 
information for access monitoring. Since we monitor application function call flow, we 
use this field to store the return address from the function. In one application function, 
there is one entry waypoint and one exit waypoint, the pair of which is stored in the 
bitmap table. Field “entries” in the bitmap table indicates how many times a waypoint 
can be passed. In our current implementation, we only distinguish between one entry 
and multiple entries to avoid malicious jump to prologue code and function main ( ) , 
which usually contain some dangerous system calls and should be entered only once. 

At a waypoint location, there should be some mechanism to trigger the waypoint 
code in the kernel. We can invoke the waypoint code at several locations: an exception 
handler, an unused system call number service routine, or a new soft interrupt handler. 
We insert an illegal opcode at the waypoint location and run our waypoint management 
code as an exception handler. 

An attacker can overwrite the return address or other code pointer to redirect control 
to a piece of shellcode or a library function. We protect the return address by saving it 
on the waypoint stack when we pass the entrance waypoint. When a waypoint return is 
executed at the exit waypoint, the return address on the regular stack is compared with 
the saved value on the waypoint stack for return address corruption. The exit waypoint 
identifier must also match with the entrance waypoint identifier, since they come in 
pairs. If the attacking code uses an unpaired exit waypoint or a faked waypoint, the 
comparison will fail. If the attack forces return into a different address, although the 
control flow can be changed, the active permission set - the permission set belonging 
to the most recently activated waypoint - is not changed, because the expected exit 
waypoint has not passed. The attacking code will still be limited to the unchanged 
permissions. 
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0804c0fc <ConfigCoding>: 



804c0fc: 


fe 


804c0fd: 


90 


804c0fe: 


90 


804c0ff: 


90 


804C100: 55 


804C101 


89 e5 


804C193 


31 cO 


804C195 


5f 


804C196 


c9 


804C197 


fe 


804C198 


90 


804C199 


90 


804c19a 


90 


804c19b 


c3 



(bad) <--entrance waypoint 
nop 
nop 
nop 

push %ebp 
mov %esp,%ebp 



xor %eax,%eax 
pop %edi 
leave 

(bad) <'-exit waypoint 
nop 
nop 
nop 
ret 



kernel space 



task_struct of the 
process being 
monitored 





Number of 

Entrance wp Exit wp entries permissions 



P 

prologue 


0x80490a8 




1 


execve, mmap, break ... 


main() 


0x8049720 


0x804978e 


1 


open, close, exit ... 


ConfigCodingO 


0x804c0fc 


0x804c197 


n 


read, write, close, ... 













Process table 



Table of permissions bitmap for application functions 



Fig. 1. Data structures needed for the waypoint-based access monitor: a waypoint stack and a 
table of permission bitmaps. The third column of the bitmap table indicates how many times a 
waypoint may be activated. The prologue code and function mainQ are allowed to run only one 
time during the process life. Function ConfigCodingQ can be called unlimited times. 



3.3 Monitoring Granularity 

In our implementation, each waypoint causes a kernel trap, and each guarded function 
has at least two waypoints (an entrance/exit pair, plus optional middle waypoints). Thus, 
the performance of the system is dependent on the granularity of waypoint insertion. 
Our first implementation monitored every function, irrespective of whatever system 
calls the function contained. As reported in section 4, the overhead can be substantial. 
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Not all system calls are equally useful for subverting a system. We define dangerous 
system calls as those rated at threat level 1 in [26]. There are 22 dangerous system calls 
in Linux: chmod, f chmod, chown, f chown, Ichown, execve, mount, rename, open, 
link, symlink, unlink, setuid, setresuid, setfsuid, setreuid, setgroups, 
setgid, self sgid, setresgid, setregid, and createunodule. Such system calls 
can be used to take full control of the system. 

Table 1 and 2 show the number of functions containing dangerous system calls, and 
the permissions distribution. The tables show us that only a small portion of the applica- 
tion functions invoke dangerous system calls. Most functions call at most one dangerous 
system call, and no function calls more than three. Only three functions (two in tar 
and one in kon2) in the whole table require exec. Most functions invoking three dan- 
gerous system calls contain only file system-related calls such as open, symlink, and 
unlink. 



Table 1. Number of functions invoking dangerous system calls, and the calls distribution. Only 
12% (50) of functions in our analysis use dangerous system calls, while 1.2% (5) of them contains 
3 dangeous calls. 



program 


# of applica- 
tion functions 
totally 


# of functions 
containing dan- 
gerous system 
calls 


containing 
3 dangerous 
system calls 


containing 
2 dangerous 
system calls 


containing 
1 dangerous 
system calls 


enscript 


48 


8 


0 


0 


8 


tar 


165 


26 


3 


3 


20 


gzip 


92 


6 


1 


3 


2 


kon2 


111 


10 


1 


5 


4 


totally 


416 


12% 


1.2% 


2.6% 


8% 



Table 2. Number of functions invoking dangerous system calls. For example, 20 functions in 
tar invokes open or rename. Only 0.7% (3) of all functions in our analysis call execve. 



program 


■/ 

■«s° 


o/ 


z/ 


# 


o<f 


c? 






48 


0 


0 


0 


8 


0 


0 


tar 


165 


0 


2 


1 


20 


3 


4 


gzip 


92 


0 


0 


0 


3 


2 


5 


kon2 


111 


0 


1 


1 


6 


4 


2 


totally 


416 


0 


0.7% 


0.5% 


9% 


2.2% 


2.6% 



The distrihution of dangerous system calls shows that partitioning of system call 
permissions should he effective. If an exploit happens within the context of one func- 
tion, the attacker can use only those system calls authorized for that function, which 
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significantly restricts the power of attack in gaining control of the system. Existing 
code-injection attacks exploit flaws in input routines, which do not, in general, have 
permission to call dangerous system calls. Open, however, is used widely in applica- 
tion functions (9% of application functions use it), requiring further restrictions to its 
parameters. 

The numbers show that as an alternative of monitoring every function, we can moni- 
tor only functions containing dangerous system calls to detect subversions. In this case, 
we have a default permission set that allows all other system calls, and only deploy 
waypoints when switching between the general, default permissions and the strict, spe- 
cific permissions associated with the function that uses dangerous system calls. This 
is a conscious trade-off of capability for performance; we no longer have a full way- 
point stack in the kernel that reflects all function calls during program execution, but 
the overhead decreases significantly, as shown in figure 3 of section 4. 



3.4 Waypoint-Based Anomaly Monitoring 

After generating waypoints and their permission sets for a program, we monitor the 
program at run time. The procedure of the waypoint-based security monitor can be 
described in the following steps: 

1 . Marking a process with waypoints 

At the process initialization stage, we mark a process by setting a flag in its cor- 
responding task_struct in the process table, indicating whether the process is 
being monitored or not. For a process being monitored, we set up a waypoint stack 
and create a table of permission bitmaps for the waypoints. The permission sets are 
generated statically. 

2. Managing the waypoints at run time 

Waypoints are authenticated by their linear addresses. We implement the manage- 
ment procedure in an exception handler. When an exception is triggered, we first 
check whether it is a legitimate waypoint or not. A legitimate waypoint satisfies 
three conditions: (1) the process is being monitored; (2) the location of the excep- 
tion (waypoint location) can be found in the legal waypoint list; and (3) the number 
of times that the waypoint is activated is less than or equal to the maximum allowed 
times. If the conditions are not satisfied, we pass control to the regular exception 
handler. 

After the verification, we manage the waypoint stack according to the type of the 
waypoint. If it is an entrance waypoint, we push it onto the waypoint stack and ac- 
tivate its permission set; if it is a middle waypoint, we only update the permissions; 
and if it is an exit waypoint, we pop the corresponding entrance waypoint from the 
stack and restore the previous permission set. After that, we emulate the original 
instruction if necessary, adjust the program counter to the location of the next in- 
struction and return from the exception handling. To simplify implementation, we 
insert 4 nops at the waypoint locations and change the first nop to a waypoint 
instruction (i.e. a bad instruction in our implementation). In this way, we can avoid 
emulating the original instructions, because nops perform no operations. 
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3. Monitoring system requests 

We implemented the access monitor as an in-kernel system call interceptor in front 
of the system call dispatcher. In terms of access control logic, the subject is the 
application function; the object is the system call number; and the operation is the 
system call request. After trapping into the kernel for a system call, the access con- 
trol monitor first verifies whether the current process is being monitored or not. If 
yes, the monitor fetches the active waypoint from the top of the waypoint stack and 
its corresponding permission set from the permissions bitmap table. If the request 
belongs to the permission set, the monitor invokes the regular system service rou- 
tine; otherwise, the monitor refuses the system call request and writes the violation 
information in the kernel log. 

3.5 Implementation Issues 

We have considered the following issues in our implementation: 

1. monitoring offspring processes 

We monitor the offspring processes the same way as we monitor the parent process. 
A child process inherits the monitor flag, the permission bitmap table, the waypoint 
stack, and the stack pointer from the parent process. If the child is allowed to run 
another program (e.g. by calling execve ( ) ), then the waypoint data structures of 
the new program will replace the current ones. 

2. multiple-thread support 

Linux uses light-weight processes to support threads efficiently. Monitoring a light- 
weight process is similar to monitoring an ordinary process, but requires a sepa- 
rate waypoint stack for every thread. Our current implementation does not support 
thread-based access monitoring. 

3. number of passes 

By restricting the number of times a waypoint can be passed during a process life 
time, we can monitor some global control flow characteristics efficiently. In par- 
ticular, we allow the program prologue to start only one time, because it typically 
invokes dangerous system calls and is logically intended to run only once. We also 
allow main ( ) to start only once per process execution. 

4. non-structured control flow 

Control flow does not always follow paths of function invocation. In the C/C-H- 
languages, the goto statement performs an unconditional transfer of control to 
a named label, which must be in the current function. Because goto does not 
cross a function boundary, it does not affect function entrance and exit waypoints. 
However, it might jump across a middle waypoint, so we do not put any middle 
waypoints between a goto instruction and the corresponding target location. 

Set jmp sets a jump point for a non-local goto, using a jmpJouf to store the cur- 
rent execution stack environment, while long jmp changes the control flow with 
the value in such a data structure. At the set j mp call, we use a waypoint to take a 
snapshot of the in-kernel waypoint stack and the jmpJouf , while at the long jmp 
location, a waypoint ensures that the target structure matches a jmpJouf in the 
kernel, and replaces the current waypoint stack with the corresponding snapshot. 
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5. permissions switch with a low-overhead policy 

Under our low-overhead policy, we only monitor functions that invoke dangerous 
system calls. These functions may call one another, or call a function with only 
default permissions, or vice versa. Permissions are switched on the function call 
boundary. In the forward direction, where the caller has specific, elevated permis- 
sions, we use a middle waypoint to switch to the default permission set before 
calling, and switch the permissions back after returning. If the callee has specific, 
elevated permissions, regular entrance and exit waypoints will activate them. 

6. the raw system interface 

To control the target system, an attacker may use the raw system interface (e.g. 
/ dev/mem and /dev/hda). This is an anomaly to most applications. Our way- 
point-based defense will restrict the opportunities for the attacker to call open, but 
further defenses, e.g. parameter checking, are necessary for complete defense. Our 
current implementation does not employ parameter checking. See [5, 25-27] for 
further information on parameter monitoring. 

3.6 Evasion Attacks and Defenses 

Because the waypoint structures and code are located in the kernel, attackers cannot 
manipulate them directly. However, an adaptive attack may create an illegal instruction 
in the data sections as a fake waypoint or jump to the middle of a legitimate instruction 
(in an X86 system) to trigger the waypoint activation mechanism. As we explained in 
section 3.2, our waypoint management code can recognize the fake waypoints because 
all the legitimate waypoints are loaded into the kernel at load time. If an attack inten- 
tionally jumps over a waypoint, although it can change the control flow, the waypoint 
stack is not updated neither is the permission set. 

Our waypoint mechanism was originally designed to counter attacks of category 1 
(shellcode based attacks) and category 2 (LCT attacks) described in section 2, because 
these attacks bypass waypoints and therefore fail to acquire the associated permissions. 
Evasion attacks may use the category 3 attack (function granularity attacks), if these 
functions invoke the exact system calls and in the correct order, required by the attacker. 
For such programs, the low-overhead policy may not supply sufficient trace information 
to support function call flow monitoring, so full monitoring on function call path should 
be done. 

If an attack launches a local mimicry attack, using one or a sequence of legitimate 
system calls of the current context, our mechanism cannot detect it. This is general case 
of abuse of the raw system interface mentioned above, and in similar fashion, we must 
employ complementary techniques. In our implementation, we adopt system interface 
randomization [2, 28] to counteract shellcode-based local mimicry attacks. 

Existing implementations of system call number randomization [2] uses a permu- 
tation of the system call numbers. A simple permutation of the relatively small space 
(less than 256 system calls) allows attackers to guess the renumbering for a particular 
system call in 128 tries on average, or 255 guesses in the worst case. 

To survive this brute force attack, we use a substitution cipher to map from 8-bit 
system call numbers to 32-bit numbers, thereby making a brute-force attack on the sys- 
tem impractical. In Linux, a system call number n is an unsigned 8-bit integer between 
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0 and 255, and is carried to the kernel in register %eax, a 32-bit register, of which 24 
bits are unused. In our implementation, we make use of the whole register to carry the 
32-bit system call number. We generate a one-to-one mapping between the 8-bit system 
call numbers and their corresponding 32-bit secrets. The access monitor restores the 
original number correspondingly upon a system call. 

3.7 An Example 

To demonstrate the effectiveness of our waypoint mechanism, we attacked a real ap- 
plication program in Linux, using both shellcode and return-into-lib(c) attacks. We 
chose kon2 version 0.3.9b as the target. kon2 is a Kanji emulator for the console. 
It is a setuid root application program. In version 0.3.9b, there is a buffer overflow vul- 
nerability in function Conf igCoding () when using the -Coding command line 
parameter. This vulnerability, if appropriately exploited, can lead to local users being 
able to gain root privileges [29], Part of the source code of the vulnerable function 
Conf igCoding ( ) is shown in figure 2(a), with the vulnerable statement highlighted. 
Figure 2(b) shows its original binary code, and figure 2(c) shows the binary code with 
waypoints added. 

To help the shellcode attack reach our waypoint mechanism, we disabled the system 
call renumbering and return address comparison features of our system during our ex- 
periment. In the following attack and defense experiment, we show how the waypoint 
mechanism can detect malicious system calls in both shellcode based and return-into- 
lib(c) based attacks. 

- Attack 1 : calling a system call instruction located in the shellcode 

In the attack, the return address of function Conf igCoding ( ) is overflowed. In 
this experiment, the faked return address redirects to a piece of shellcode. With- 
out our protection, the attacking code generated a shell. With our mechanisms de- 
ployed, the malicious system request execve ( ' ' /bin/ sh' ' ) was caught and 
the shell was not generated. At the location of the ret instruction, an exit way- 
point is triggered, and the permissions for Conf igCoding ( ) ’s parent function 
(ReadConf ig ( ) ) are activated. Because execve ( ) is not among the permis- 
sions of ReadConf ig ( ) , the system request is denied. It is interesting to see that 
if the return address is overwritten, the malicious request is issued in the context of 
the parent function, because the malicious request is issued after the execution of 
instruction ret and the exit waypoint. If our mechanisms are fully deployed, the 
exit waypoint will guarantee that the return address is not faked. 

- Attack 2: A low-level control transfer attack 

Recall that a low-level control transfer attack can redirect control to legitimate code 
for malicious purposes. In our experiment, we use the location of int execve 
(const char *filename, char *const argv[], char *const 
envp [ ] ) , a sensitive libc function, in the attacking code. Because neither 
Conf igCoding () nor its caller ReadConf ig () have the permission to call 
system call execve ( ) , the request is rejected by our monitor. 

Note, it is difficult to detect the return-into-lib(c) attacks. Program shepherding [25] 
ensures that library functions are called at only library entrance locations, and the 
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static int ConfigCoding(const char *confstr) 

{ 

char reg[3][MAX_COLS]; <— Fixed size buffer MAX_COLS=256 
int n, i; 

*reg[0] = *reg[l] = *reg[2] = ’\0’; 
sscanf(confstr, "%s %s %s", reg[0], reg[l], reg[2]); 

ovecflow Vulnerability here 



return SUCCESS; 

) 

(a) A buffer-overflow vulnerable function in kon2 



0804c0fc <ConfigCoding>: 



804c0fc: 


90 


nop 




804c0fd: 


90 


nop 




804c0fe: 


90 


nop 




804c0ff: 


90 


nop 




804c 100 


55 


push 


%ebp 


804c 101 


89 e5 


mov 


%esp,%ebp 


804c 193 


31 cO 


xor 


%eax,%eax 


804c 195 


5f 


pop 


%edi 


804c 196 


c9 


leave 




804c 197 


90 


nop 




804c 198 


90 


nop 




804c 199 


90 


nop 




804c 19a 


90 


nop 




804c 19b 


c3 


ret 




(b) the original binary code 
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nop 
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89 e5 


mov 
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31 cO 
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%eax,%eax 
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5f 
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c9 
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fe 


(bad) 


< — exit waypoint 


804c 198 


90 


nop 




804c 199 


90 


nop 




804c 19a 


90 


nop 




804c 19b 


c3 


ret 





(c) the binary code with waypoints added 



Fig. 2. A buffer overflow vulnerable function in kon2 and its waypoints. 
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library callee functions must exist in the external symbol table of the ELF format 
program. In kon2, because execl ( ) and execlp ( ) are used at other locations, 
there are corresponding entries in the external symbol table; so at any library en- 
trance point, this request can pass the shepherding check. In addition, program 
shepherding monitors control flow only, so it is possible for an attack to compro- 
mise control flow related data (e.g. GOT), making the return-into-lib(c) attack re- 
alistic. In an IDS without control flow information, because execve ( ) is used in 
the program, a mimicry attack may pass the check. 

The only dangerous system call in the context of Conf igCoding ( ) is open ( ) . 
Within this context, the attacker does not have much freedom in gaining control of 
the system. Launching an execve ( ) requires a global mimicry attack that crosses 
function boundaries, which is subject to both the call flow and permissions monitoring. 

4 Overhead Measurement and Analysis 

We measured the overhead of the waypoint-based access monitor on a system of Red- 
Hat Linux 9.0 (kernel version 2.4.20-8) on a 800MHz AMD Duron PC with 256MB 
memory. 

The overhead of the waypoint-based access monitor has two main causes: waypoint 
registration in the exception handler and running the access monitor at each system 
call. The system call mapping is done before running, so it does not introduce any run- 
time overhead. The remapping at each system call is a binary search on a 256 entry 
table in our implementation. Because the remapping takes only tens of instructions, this 
overhead is negligible. The access monitor at the system call invocation compares the 
coming request number with the permission bitmap. These comparison operations cost 
little time. Therefore, the majority of the overhead is from the additional trap for the 
waypoint registration code in the exception handler, where caches and pipelines will be 
flushed. 

Our measurement on a micro-benchmark program that calls a monitored function in 
a tight loop shows that the overhead for one waypoint invocation is 0.395 microseconds 
on average. This captures the cost of exception handling, but does not reveal overhead 
due to cache and pipeline flushing. 

To better understand these effects on real applications, we tested a few well known 
GNU applications. We did not use real time , the time between program start and end, 
because the overhead can be hidden by the overwhelming I/O time. Instead, we use user 
time and sys time , the time that measures the process running in user mode and kernel 
mode, correspondingly. These time gives us an accurate understanding of the overhead. 

As shown in figure 3, when we monitor all functions, the user time increases by 
about 10%-20%, but the system time increases dramatically. We attribute the increase 
in user time to the flushing of cache and pipelines. In the GNU programs we measured 
3-5 times overhead due to our waypoint mechanism. 

When we monitor only dangerous functions, the overhead is smaller than for moni- 
toring all functions. Dangerous-function monitoring for enscript, gzip and 
gunzip introduces small overhead, but the overhead for tar is still high. In gunzip, 
there are only a few function calls for checking the zip file and for decompressing it. 
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Fig. 3. Overhead of the waypoint-based access monitor. For every group, the left side bar shows 
the time of running the original program; the middle bar shows its running time under waypoint- 
based access monitoring for all functions; and the right side bar shows the result with monitoring 
only functions that invoke dangerous system calls. 

Because there are only a few waypoint invocations in the entire program execution, the 
running time is close to the original running time. We conclude that the overhead de- 
pends on not only how many functions are monitored, but also how frequently these 
functions are invoked. 

5 Related Work 

There are three layers of defense in preventing attacks from subverting the system. The 
first layer of defense is to prevent the malicious data and code from being injected, 
typically by avoiding and tolerating implementation errors in the system. Existing tech- 
niques include language-based or compiler-based techniques, such as type checking [9, 
30-32], or protecting data pointers [33] and format strings [3]. The second layer of de- 
fense is to prevent malicious code from being executed. Prevention methods include 
instruction set randomization [34,35], non-executable stack and heap pages [8,10], 
process image randomization [10, 13], and stack integrity guarding [4, 11]. The third 
layer of defense attempts to prevent the executing attack code from doing further harm 
though the system interface. Existing work at this stage includes anomaly detection [5, 
6, 12, 24, 25, 27], process randomization [2, 10, 13, 28, 36], and instruction set random- 
ization [34, 35]. 
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Realizing that lack of context information in detection leads to certain false nega- 
tives possible (e.g., the impossible-path problem and the mimicry attacks), some 
anomaly monitors apply partial context information in anomaly detection [5,24,25], 
The benefit of using context information is that control path information between two 
system call invocations can help detecting anomaly. 

Retrieving user call stack information in system call interceptor [5] is promising in 
bringing function call flow information to the anomaly monitor. We explore this ap- 
proach further by providing trustworthy control flow information to the monitor. One 
other difference is that while [5] emphasize the call stack signature at a system call in- 
vocation, we put much effort on guarding with the permissions of application functions. 
Program shepherding [25] uses an interpreter to monitor the control flow of a process. It 
enforces application code to call library functions only through certain library entrance 
points, and the target library function must be one of the external functions listed in the 
external symbol table of the application executable. Because program shepherding does 
not monitor the data flow, some control flow information, such as function pointers, may 
be overwritten. If the overwritten pointer happens to be a library entry point, and the 
attack chooses a library function that is used at any other locations in the program, the 
attack can pass the check. Context related permissions can help in this situation. [24] 
associates a system call with its invocation address. The return-into-lib(c) attack calls 
a library function, rather than a piece of shellcode. In this case, the locations do not 
provide enough control path information to the detector. 

6 Conclusion 

In this paper, we propose a new mechanism - waypoints - to provide trustworthy control 
flow information for anomaly monitoring. We demonstrated how to use our waypoint 
mechanism to detect global mimicry attacks. Our approach can also catch return-into- 
others impossible paths by guarding the return addresses. Implementing waypoints by 
kernel traps provides reliable control path information, but slows down an ordinary 
program by 3-5 times. As a trade-off, by monitoring only dangerous system calls, we 
can reduce the overhead by 16%-70%, but no longer monitor the complete function call 
path. 

As noted in our discussion of access to the raw system interface, waypoint-based 
detection cannot find local mimicry attacks, because the function has the proper permis- 
sions required to invoke the dangerous system calls. In our current implementation, we 
associate a permission set with each waypoint, but a state machine can provide tighter 
monitoring than a set. We will also investigate the use of complementary techniques, 
such as parameter checking, to extend waypoints to defend against local mimicry at- 
tacks. 

Impossible paths may be generated at multiple granularities. Our waypoint mecha- 
nism can detect only function-granular return-into-others impossible paths by guarding 
return addresses. 

Our waypoint mechanism cannot directly detect attacks through interpreted code. 
Because we work at the binary code level, our mechanism does not “see” the interpreted 
code. Rather, it monitors the interpreter itself, and so only sees actions taken by the 
interpreter in response to directives in the interpreted code. 
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So far, we generate waypoints and their permissions statically, which does not sup- 
port self-loading code. Our future work will be to support self-loading code by moving 
waypoint set up procedure (by code instrumentation) to program load time. Additional 
future work is to optimize performance. Some optimizations that we have discussed are 
hoisting waypoints out of loops and merging waypoints for several consecutively called 
functions. 

Our prototype implementation of the waypoint mechanism for Linux X86 system 
may be downloaded from http://www.sai.syr.edu/projects. 
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Abstract. Worm detection systems have traditionally used global strategies and 
focused on scan rates. The noise associated with this approach requires statisti- 
cal techniques and large data sets (e.g., monitored machines) to yield timely 
alerts and avoid false positives. Worm detection techniques for smaller local net- 
works have not been fully explored. 

We consider how local networks can provide early detection and compliment 
global monitoring strategies. We describe HoneyStat, which uses modified hon- 
eypots to generate a highly accurate alert stream with low false positive rates. Un- 
like traditional highly-interactive honeypots, HoneyStat nodes are script-driven, 
automated, and cover a large IP space. 

The HoneyStat nodes generate three classes of alerts: memory alerts (based on 
buffer overflow detection and process management), disk write alerts (such as 
writes to registry keys and critical files) and network alerts. Data collection is au- 
tomated, and once an alert is issued, a time segment of previous traffic to the node 
is analyzed. A logit analysis determines what previous network activity explains 
the current honeypot alert. The result can indicate whether an automated or worm 
attack is present. 

We demonstrate HoneyStat’s improvements over previous worm detection tech- 
niques. First, using trace files from worm attacks on small networks, we demon- 
strate how it detects zero day worms. Second, we show how it detects multi vector 
worms that use combinations of ports to attack. Third, the alerts from HoneyStat 
provide more information than traditional IDS alerts, such as binary signatures, 
attack vectors, and attack rates. We also use extensive (year long) trace files to 
show how the logit analysis produces very low false positive rates. 

Keywords: Honeypots, Intrusion Detection, Alert Correlation, Worm Detection 



1 Introduction 

Worm detection strategies have traditionally relied on artifacts incidental to the worm 
infection. For example, many researchers measure incoming scan rates (often using 
darknets) to indirectly detect worm outbreaks, e.g., [ZGGT03]. But since these tech- 
niques measure noise as well as attacks, they often use costly algorithms to identify 
worms. For example, [ZGGT03] suggests using a Kalman filter [Kal60] to detect worm 
attacks. In [QDG+], this approach was found to work with a large data set but proved 
inappropriate for smaller networks. 
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To improve detection time and decrease errors caused by noise, the solution so far 
has been to increase monitoring efforts, and gather more data. The intuition is that 
with more data, statistical models perform better. Thus, researchers have suggested the 
creation of global monitoring centers [MSVS03], and collecting information from dis- 
tributed sensors. These efforts are already yielding interesting results [YBJ04,Par04]. 

Although the need for global monitoring is obvious, the value this has for local 
networks is not entirely clear. For example, some local networks might have enough 
information to conclude a worm is active, based on additional information they are un- 
willing to share with other monitoring sites. Likewise, since global detection strategies 
require large amounts of sensor data, worm outbreaks may be detected only after local 
networks fall victim. Also, we see significant problems in gaining consensus among dif- 
ferent networks, which frequently have competing and inconsistent policies regarding 
privacy, notification, and information sharing. Without doubt, aggregating information 
from distributed sensors makes good sense. However, our emphasis is on local net- 
works and requires a complimentary approach. In addition to improving the quantity of 
monitoring data, researchers should work to improve the quality of the alert stream. 

In this paper, we propose the use of honeypots to improve the accuracy of alerts 
generated for local intrusion detection systems. To motivate the discussion, we describe 
in Section 3 the worm infection cycle we observed in honeypots that led to the cre- 
ation of HoneyStat. Since honeypots usually require labor-intensive management and 
review, we describe in Section 4 a deployment mechanism used to automate data col- 
lection. HoneyStat nodes collect three types of events: memory, disk write and network 
events. Section 4 describes these in detail, and discusses a way to compare and correlate 
intrusion events. Using logistic regression, we analyze previous network traffic to the 
honeypot to see what network traffic most explains the intrusion events. Intuitively, the 
logit analysis asks if there is a common set of network inputs that precede honeypot 
intrusions. Finding a pattern suggests the presence of an automated attack or worm. 

To demonstrate HoneyStat’s effectiveness, in Section 6, we describe our experience 
deploying HoneyStat nodes, and a retrospective analysis of network captures. We also 
use lengthy (year long) network trace files to analyze the false positive rate associated 
with the algorithm. The false positive rate is low, due to two influences: (a) the use 
of honeypots, which only produce alerts when there are successful attacks, and (b) the 
use of user-selected confidence intervals, which let one define a threshold for alerts. 
Finally, in Section 7, we analyze whether a local detection strategy with a low false 
positive rate (like HoneyStat) can make an effective worm detection tool. We consider 
the advantages this approach has for local networks. 



2 Related Work 

Honeypots. A honeypot is a vulnerable network decoy used for several purposes: (a) 
distracting attackers, (b) gathering early warnings about new attack techniques, (c) fa- 
cilitating in-depth analysis of an adversary’s strategies [Spi03,Sko02]. By design, a hon- 
eypot should not receive any network traffic, nor will it run any legitimate production 
services. This greatly reduces the problem of false positives and false negatives often 
found in other types of IDS systems. 
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Traditionally, honeypots have been used to gather intelligence about how human 
attackers operate [Spi03]. The labor-intensive log review required of traditional honey- 
pots makes them unsuitable for a real-time IDS. In our experience, data capture and log 
analysis time can require a 1 :40 ratio, meaning that a single hour of activity can require 
a week to fully decipher [LLO^03]. 

The closest work to our own is [LLO+03], which uses honeypots in an intrusion 
detection system. We have had great success at the Georgia Institute of Technology 
utilizing a Honeynet as an IDS tool, and have identified a large number of compro- 
mised systems on campus, mostly the result of worm-type exploits. Other researchers 
have started to look at honeypot alert aggregation techniques [JX04], but presume a 
centralized honeypot farming facility. Our simplified alert model allows for distributed 
honeypots, but is more focused on defending local networks. 

Researchers have also considered using virtual honeypots, particularly with honeyd 
[Pro03]. Initially used to help prevent OS fingerprinting, honeyd is a network daemon 
that exhibits the TCP/IP stack behavior of different operating systems. It has since been 
extended to emulate some services (e.g., NetBIOS). Conceptually, honeyd is a daemon 
written using libpcap and libdnet. To emulate a service, honeyd requires researchers 
to write programs that completely copy the service’s network behavior. Assuming one 
can write enough modules to emulate all aspects of an OS, a single machine can delay 
worms by inducing needless connections. 

Recently, honeyd was offered as a way to detect and disable worms [Pro03]. We 
believe this approach has promise, but must overcome a significant hurdle before it is 
used as an early warning IDS. It is not clear how a daemon emulating a network service 
can catch zero day worms. If one knows a worm’s attack pattern, it is possible to write 
modules that will behave like a vulnerable service. But before this is known, catching 
zero day worms requires emulating even the presumably unknown bugs in a network 
service. We were unable to find solutions to these limitations, and so do not consider 
virtual networks as a means of providing an improved alert stream. Instead, we used 
full honeypots. 

More closely related to our work, [Kre03] suggested automatic binary signature 
extraction using honeypots. This work used honeyd, flow reconstruction, and pattern 
detection to generate IDS rules. The honeycomb approach also has promise, but uses 
a very simple algorithm (longest common substring) to correlate payloads. This makes 
it difficult to identify polymorphic worms and worms that use multiple attack vectors. 
Honeycomb was well suited for its stated purpose, however: extracting string signatures 
for automated updates to a firewall. 

Worm Detection. Worm propagation and early detection have been active research top- 
ics in the security community. In worm propagation, researchers have proposed an epi- 
demic model to study worm spreading, e.g., [Sta01,ZGT02,CGK03,WL03]. For early 
detection, researchers have proposed statistical models, e.g., Kalman Filter [ZGGT03], 
analyzing repeated outgoing connections [Wil02], ICMP messages collected at border 
routers to infer worm activity [BGB03] and victim counter-based detection algorithms 
[WVGK04]. All these approaches require a large deployment of sensors or a large mon- 
itoring IP space (e.g., IP addresses). Others suggest a “cyber Center for Disease 
Control” to coordinate data collection and analysis [SPN02]. Researchers have also pro- 
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posed various data collection and monitoring architectures, e.g., “network telescopes” 
[Moo02b] and an “Internet Storm Center” [Ins]. 

Our objective is also to conduct early worm detection. However, considering the 
current difficulty and challenges in large space monitoring system (e.g., conflicts in 
data sharing, privacy, and coordinated responses), our detection mechanism is based 
on local networks, in particular, local honeypots for worm detection. In our prior work 
[QDG+j we analyzed the current worm early detection algorithms, i.e., [ZGGT03] and 
[WVGK04], and found instability and high false positives when applying these tech- 
niques to local monitoring networks. 

Event Correlation. Several techniques have been proposed for the alert/event correla- 
tion, e.g., pre-/post-condition-based pattern matching [NCR02,CM02,CLF03], chroni- 
cles formalism [MMDD02], clustering technique [DWOl] and probabilistic-based cor- 
relation technique [VS01,GHH01]. All these techniques count on some prior knowl- 
edge of attack step relationships. Our approach is different in its need to detect zero-day 
worm attacks, and does not depend on prior knowledge of attack steps. Statistical alert 
correlation was presented in [QL03]. Our work is different in that our correlation anal- 
ysis is based on variables collected over short observations. Time series-based analysis 
proposed in [QL03] is good for relatively long observation variables and requires a 
series of statistical tests in order to accurately correlate observations. 



3 Worm Infection Cycles 

If local networks do not have access to the volume of data used by global monitoring 
systems, what local resources can they use instead? Studying worm infections gives 
some insights, and identifies what data can be collected for use in a local IDS. 

Model of Infection. A key assumption in our monitoring system is that the worm in- 
fection can be described in a systematic way. We first note that worms may take three 
types of actions during an infection phase. The Blaster worm is instructive, but we do 
not limit our model to this example. Blaster consists of a series of modules designed to 
infect a host [LUR03]. 

Memory Events. The infection process, illustrated in Figure 1(a), begins with a probe 
for a victim providing port 135 RPC services. The service is overflowed, and the victim 
spawns a shell listening on a port, usually 4444. (Later generations of the worm use 
different or even random ports.) This portion of the infection phase is characterized by 
memory events. No disk writes have taken place, and network activity cannot (yet) be 
characterized as abnormal, since the honeypot merely ACKs incoming packets. Still, a 
buffer overflow has taken place, and the infection has begun by corrupting a process. 

Network Events. The Blaster shell remains open for only one connection and closes 
after the infection is completed. The shell is used to instruct the victim to download 
(often via tftp) an “egg” program. The egg can be obtained from the attacker, or a third 
party (such as a free website, or other compromised hosts.) The time delay between the 
initial exploit and the download of the egg is usually small in Blaster, but this may not 
always be the case. Exploits that wait a long period to download the egg risk having the 
service restarted, canceled, or infected by competing worms (e.g., Nachi). Nonetheless, 
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some delay may occur between the overflow and the “egg” transfer. All during this 
time, other harmless network traffic may arrive. This portion of the infection phase is 
often characterized by network traffic. Downloading the egg for example requires the 
honeypot to initiate TCP (S YN) or UDP traffic. In some cases, however, the entire worm 
payload can be included in the initial attack packet [Sta01,Moo02a,ZGT02]. In such a 
case, network events may not be seen until much later. 

Disk Events. Once the Blaster egg is downloaded, it is written to a directory so it may 
be activated upon reboot. Some worms (e.g.. Witty [LUR04]) do not store the payload 
to disk, but do have other destructive disk operations. Not every worm creates disk 
operations. 

These general categories of events, although present in Blaster, do not limit our anal- 
ysis to just the August 2003 DCOM worm. A local detection strategy must anticipate 
future worms lacking some of these events. 

Improved Data Capture. Traditional worm detection models deal with worm infection 
at either the start or end of the cycle shown in Figure 1 (a). For example, models based on 
darknets consider only the rate and sometimes the origin of incoming scans, the traffic 
at the top of the diagram. The Destination Source Correlation (DSC) model [GSQ+04] 
also considers scans, but also tracks outgoing probes from the victim (traffic from the 
bottom of the diagram). The activity in the middle of the cycle (including memory and 
disk events) can be tracked. 

Even if no buffer overflow is involved, as in the case of mail-based worms and 
LANMAN weak password guessing worms (e.g., pubstro worms), the infection still 
follows a general pattern: a small set of attack packets obtain initial results, and further 
network traffic follows, either from the egg deployment, or from subsequent scans. 

Intrusion detection based only on incoming scan rates must address the potentially 
high rate of noise associated with darknets. As noted in Figure 1(a), every phase of the 
infection cycle may experience non-malicious network traffic. Statistical models that 
filter the noise (e.g., Kalman) require large data sets for input. It is no wonder, then, 
that scan-based worm detection algorithms have recently focused on distributed data 
collection. 

4 HoneyStat Configuration and Deployment 

The foregoing analysis of the worm infection cycle generally identified three classes 
of events that one might track in an IDS: memory, disk and network events. As noted 
above in Section 2, it is difficult to track all of these events in virtual honeypots or 
even in stateful firewalls. Networks focused on darknets, of course, have little chance 
of getting even complete network events, since they generally blackhole SYN packets, 
and never see the full TCP payload. 

A complete system is needed to gather the worm cycle events and improve the data 
stream for an IDS. We therefore use a HoneyStat node, a minimal honeypot created 
in an emulator, and multihomed to cover a large address space. The deployment typ- 
ically would not be interesting to attackers, because of its minimal resources (limited 
memory, limited drive size, etc.) Worms, however, are indiscriminating and use this 
configuration. 
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In practice, we can use VMware GSX Server as our honeypot platform. Currently, 
VMware GSX Server V3 can support up to 64 isolated virtual machines on a sin- 
gle hardware system [VMW04]. Mainstream operating systems (e.g. Windows, Linux, 
FreeBSD) all support multihoming. For example, Windows NT allows up to 32 IP ad- 
dresses per interface. So if we use a GSX server with 64 virtual machines running 
windows and each windows having 32 IP addresses, then a single GSX machine can 
have 64 * 32 = 2^^ IP addresses. 

In practice, we found nodes with as little as 32MB RAM and 770MB virtual drives 
were more than adequate for capturing worms. Since the emulators were idle for the 
vast majority of time, many instances could be started on a single machine. Although 
slow and unusable from a user perspective, these virtual honeypots were able to respond 
to worms before any timeouts occur. 

The honeypots remain idle until a HoneyStat event occurs. We define three types of 
events, corresponding to the worm infection cycle discussed in Section 3. 

1. MemoryEvent. A honeypot can be configured to run buffer overflow protection 
software, such as a StackGuard [lnc03], or similar process-based monitoring tools. 
Likewise, Windows logs can be monitored for process failures and crashes. Any 
alert from these tools constitutes a HoneyStat event. Because there are no users, 
we found that one can use very simple anomaly detection techniques that would 
otherwise trigger enormous false positive rates on live systems. Since there are no 
users, even simple techinques work well. 

2. NetworkEvents. The honeypots are configured to generate no outgoing traffic. 
If a honeypot generates SYN or UDP traffic, we consider it an event. 

3. DiskEvents. Within the limits of the host system, we can also monitor honeypot 
disk activities and trap writes to key file areas. For example, writes to systems logs 
are expected, while writes to C: \WINNT\SYSTEM32 are clearly events. In prac- 
tice, we found that kqueue [LemOI] monitoring of flat virtual disks was reasonably 
efficient. One has to enumerate all directories and files of interest, however. 

Data recorded during a HoneyStat event includes: (a) The OS/patch level of the 
host, (b) The type of event (memory, net, disk), and relevant capture data. For memory 
events, this includes stack state or any core, for network events this is the outgoing 
packet, and for disk events this includes a delta of the file changes, up to a size limit, 
(c) A trace file of all prior network activity, within a bound tp, discussed in Section 5. 

Based on our analysis in Section 3, we believe this to be a complete set of features 
necessary to observe worm behavior. However, new worms and evasive technologies 
will require us to revisit this heuristic list of features. Additionally, if HoneyStat is 
given a larger mission (e.g., e-mail virus detection or trojan analysis instead of just 
worm detection), then more detailed features must be extracted from the honeypots. 

Once events are recorded, they are forwarded to an analysis node. This may be on 
the same machine hosting the honeypots, or (more likely) a central server that performs 
logging and propagates the events to other interested nodes. Figure 1(b) shows a con- 
ceptual view of one possible HoneyStat deployment. In general, the analysis node has a 
secure channel connecting it with the HoneyStat servers. Its primary job is to correlate 
alert events, perform statistical analysis, and issue alerts. 
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Fig. 1. a) A time line of a Blaster worm attack. Because of modular worm architectures, victims 
are first overflowed with a simple RPC exploit, and instructed to obtain a separate worm “egg”, 
which contains the full worm. The network activity between the initial overflow and download 
of the “egg” constitutes a single observation. Multiple observations allow one to filter out other 
scans arriving at the same time, b) HoneyStat nodes interact with malware on the Internet. Alerts 
are forwarded through a secure channel to an analysis node for correlation. 
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Fig. 2. a) In the top diagram, a HoneyStat MemEvent occurs, and the honeypot is allowed to 
continue, in hopes of capturing an egg or payload. (The event is immediately analyzed, how- 
ever, without delay.) If a subsequent NetEvent occurs, we update the previous MemoryEvent 
event and reset. In the bottom diagram, we see a NetEvent without any prior MemoryEvent, 
indicating our host-based IDS did not spot any anomaly. We immediately reset, and analyze the 
circled traffic segment, b) Aggregating multiple honeypot events can help spot a pattern. In all 
three diagrams, activity Pa is observed, even though others (Pb, Pc) appear closer in time to the 
event. 



HoneyStat events are placed in a work queue and analyzed by worker threads. We 
can prioritize certain types of events, such as when outgoing connections attempt to 
reach many different IPs (i.e., it looks like a worm scanning.) These events are obviously 
more important than others, since they suggest an automated spread mechanism instead 
of a single connection hack to the attacker’s IP. This idea borrows from the work of 
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researchers who observed the importance of tracking distinct destination IPs in worm 
detection systems [JPBB04,WVGK04]. Future work will explore queue processing and 
other prioritizations. 

Several actions are taken when a HoneyStat event is analyzed. 

1. First, we check if the event corresponds to a honeypot that has already been 
recorded as “awake” or active. If the event is a continuation of an ongoing in- 
fection, we simply annotate the previous event with the current event type. For 
example, if we first witness a MemoryEvent, and then see a DiskEvent for the 
same honeypot, we update the MemoryEvent to include additional information, 
such as the DiskEvent and all subsequent network work activity. The intuition 
here is that MemoryEvents are usually followed by something interesting, and it 
is worth keeping the honeypot active to track this. 

2. Second, if the event involved NetworkEvents (e.g., either downloading an egg 
or initiating outgoing scans), the honeypot reporting the event is reset. The idea here 
is two-fold. In keeping with the principle of honeypot Data Control [LLO^03], 
we need to prevent the node from attacking other machines. Also, once network 
activity is initiated, we have enough attack behavior recorded to infer that the worm 
is now infective. If only DiskEvents or MemoryEvents are observed, the node 
is not reset. 

Since the honeypot is deployed using an emulator, resets are fast in practice. One 
merely has to kill and restart the emulator, using in round-robin style a fresh copy 
of the virtual disk. The disk image is kept in a suspended state, and no reboot of 
the guest OS is required. The reset delay is slight, often seconds or a minute, and 
always in practice completes before TCP timeouts occur. The effect of startup time 
on detection is considered in the analysis in Section 6. 

3. Third, the analysis node examines basic properties of the event, and determines 
whether it needs to redeploy other honeypots to match the affected OS. The intu- 
ition here is that HoneyStat nodes are often deployed to cover a variety of operating 
systems: Linux, Windows, and with different patch levels. If one of the systems 
falls victim to a worm, it makes sense to redeploy most of the other nodes to run 
the vulnerable OS. This improves the probability that an array of HoneyStat nodes 
will capture similar events. Again, the delay this causes for detection is discussed 
in Section 7. 

4. Finally, the HoneyStat event is correlated with other observed events. If a pattern 
emerges, this can indicate the presence of a worm or other automated attacks. Any 
reasonable correlation of events can be done. In the next section, we present a 
candidate analysis based on logistic regression. 

As an example, in Figure 2(b), we see three different honeypots generating events. 
Prior input to the honeypots includes a variety of sources. For simplicity, the example 
in Figure 2(b) merely has three different active ports, Pa,Pb, Pc- Intuitively, we can 
use the time difference between the honeypot event and the individual port activity to 
infer what caused the honeypot to become active. But if all these events are from the 
same worm, then one would expect to see the same inputs to all three honeypots. In 
this case, only Pa is common to all three. A logistic regression presents a more flexible 
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way of discovering the intersection of all inputs and provides a better explanation why 
a honeypot has become active. 

4.1 Honeypot Evasion 

As more honeypots are used in IDS settings, attackers may attempt to evade by having 
worms detect and avoid honeypot traps. Honeypot researchers have observed that a few 
assembly instructions behave slightly differently in various (often incomplete) emula- 
tors, and that emulated hardware may have predictable signatures (e.g., BIOS Strings, 
MAC address ranges for network cards) [Cor04,Sei02]. One can prevent trivial honey- 
pot detection by patching emulated VMs [Kor04], and removing any obvious indicators 
like registry keys. 

Hand crafted assembly instructions designed to detect VMWare present a different 
problem. Since Intel chips don’t support multiple zero ring contexts, some instructions 
will elicit a VM monitor error, allowing attackers to evade the honeypot trap. This can 
be countered by filtering incoming traffic to identify the limited instruction set designed 
to detect VMWare. Failing this (e.g., if the emulator detection code is polymorphic), one 
can always just treat the crashed emulator as a HoneyStat memory event. This yields a 
more limited alert (e.g., you miss disk events), but allows HoneyStat to correlate what 
caused the error. 

Attackers might also attempt to make machine observations, e.g., the time needed 
to perform lengthy calculations. This potentially benefits defenders, since worms may 
have a slower propagation rate, allowing for human intervention and earlier detection. 
Ultimately, we believe the honeypot evasion problem may devolve into a classic cat- 
and-mouse game, not unlike virus detection. In this case, however, the tables are turned, 
and it is the attacker who must perform reliable detection in a changing environment. 



5 Logistic Analysis of HoneyStat Events 

Our key objective is the detection of zero-day worms, or those without a known signa- 
ture. Without the ability to perform pattern matching, our task is analogous to anomaly 
detection. We therefore use a statistical analysis of the events to identify worm behav- 
ior. Statistical techniques, e.g., [MHL94,AFV95,PN97,QVWW98], have been widely 
applied in anomaly detection, . In our prior work, we applied time series-based statisti- 
cal analysis to alert correlation [QL03]. 

Our preference was for a technique that can effectively correlate variables collected 
in a short observation window with a short computation time. Time series-based anal- 
ysis is good for a relatively long observation and requires a series of statistical tests in 
order to accurately correlate variables. It is also often not suitable for real-time analy- 
sis because of its computationally intensive nature. Therefore, in this work, we instead 
apply logistic analysis [HLOO] to analyze port correlation. 

Logistic regression is a non-linear transformation of the traditional linear regression 
model. Instead of correlating two continuous variables, logistic regression considers 
(in the simplest case) a dichotomous variable and continuous variables. That is, the 
dependent variable is a boolean “dummy” variable coded as 0 or 1 , which corresponds 
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to a state or category we wish to explain. In our case, we treat the honeypot event as 
a dichotomous variable, i.e., the honeypot is either awake (1) or quiescent (0). Logit 
analysis then seeks to explain what continuous variables explain the changes in the 
honeypot state, from asleep to awake. 

We settled on using a logit analysis only after considering other, more restrictive 
analysis techniques. A simple linear regression, for example, would compare continu- 
ous-to-continuous variables. In the case of honeypots, this would require either mea- 
suring rates of outgoing packets, or identifying some other continuous measurement in 
the memory, network and disk events. Since it only takes one packet to be infected or 
cause an infection to spread, a simple linear regression approach would not clearly iden- 
tify “sleeper worms” (a false negative scenario) and worms on busy networks (a false 
positive potential). Additionally, measuring outgoing packet rates would also include 
a significant amount of noise, since honeypots routinely complete TCP handshakes for 
the services they offer (e.g., normal, non-harmful webservice, mail service, ftp con- 
nections without successful login, etc.). Using continuous variables based on outgoing 
rates may only be slightly better than using incoming scan rates. 

The basic form of the model expresses a binary expectation of the honeypot state, 
E{Y) (asleep or awake) for k events, as seen in Eq. (1). 

, fe 

E{Y) = ^ _z ’ whereZ = /3o-f e-f (1) 

j=i 1=1 

In Eq. (1), j is a counter for each individual honeypot event, and i is a counter for 
each individual port traffic observation for a specific honeypof. Each f3ij is the regres- 
sion coefficient corresponding to the Xij variable, a continuous variable representing 
each individual port observation. We have one error term e and one constant /3 q for the 
equation. To set values of Xij, we use the inverse of time between an event and the port 
activity. Thus, if aMemoryEvent (or honeypot event j) occurs at time t, and just prior 
to this, port i on that same honeypot experienced traffic af time t — St, the variable Xij 
would represent the port in the equation, and would have the value of This biases 
towards network traffic closer in time to the event, consistent with our infection model 
discussed in Section 3. 

An example shows how honeypot events are aggregated. Suppose one honeypot 
event is observed, with activity to ports {Pi, P 2 , . . . , Pn}- We calculate the inverse 
time difference between the port activity and the honeypot event, and store the val- 
ues for Xi^i,X 2 ,i, ■ ■ ■ AT„ 1 in a table that solves for Y. Suppose then a second event 
is recorded, in the same class as the first. We add the second event’s values of X 12 , 
X 2 , 2 ., ■ • ■ , AT „_2 to the equation. This process continues. After each new event is added, 
we resolve for Y , and calculate new values of (3. After sufficient observations, the logit 
analysis can identify candidate ports that explain why the honeypots are becoming ac- 
tive. 

The inverse time relation between event and prior traffic allows one to record arbi- 
trary periods of traffic. Traffic fhaf occurred too long ago will, in practice, have such 
a low value for ATj j that it cannot affect the outcome. As a convenience, we cut off 
prior traffic tp at 5 minutes, but even this arbitrary limit is generous. Euture work will 
explore use of other time treatments, such as and as a means of further biasing 
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toward more recent network events. Note that this assumption prevents HoneyStat from 
tracking worms that sleep for a lengthy period of time before spreading. These worms 
are presumably self-crippling, and have a slow enough spread rate to allow for human 
intervention. 

A key variable in this analysis includes the Wald statistic, which lets us test whether 
a variable’s coefficient is zero. The Wald statistic is merely the ratio of the coefficient 
to its standard error, with a single degree of freedom [HLOO] . The Wald statistic can be 
used to reject certain variables, and exclude them from a model. For example, if ports 
Pq, Pi, . . . Pn were observed prior to a honeypot event, we might exclude some of these 
ports based on the ratio of their coefficient f3ij, and their standard error. Thus, the Wald 
statistic essentially poses a null hypothesis for each variable, and lets us exclude vari- 
ables with zero coefficients. (After all, a variable with a zero f3 value does not contribute 
to solving Eq. 1). This analysis is helpful since it reduces noise in our model. However, 
since it uses a simple ratio, when the standard error is large, it can lead one to not reject 
certain variables. Thus, the Wald statistic can be used to remove unlikely variables, but 
might not always remove variables that have no affect. 

Applying logistic analysis involves the following steps. First, for a particular hon- 
eypot event j, we estimate the coefficients, i.e., /3o,j,Pi,j ■ • ■ Pn,j, using maximum like- 
lihood evaluation [HLOO] (MLE). In this step, we try to find a set of coefficients that 
minimize the prediction error. Stated another way, MLE assigns values that will maxi- 
mize the probability of obtaining the observed set of data. (This is similar to the least 
squares method under simple regression analysis.) Second, we use the Wald statistic 
to evaluate each variable, and remove those below a user-selected threshold of signif- 
icance level, say, 5%. The intuition of this step is that we try to evaluate whether the 
“causal” variable in the model is significantly related to the outcome. In other words we 
essentially ask the question; Is activity on port x significantly related to the honeypot 
activity or was it merely random? 

If the analysis results in a single variable explaining changes in the honeypot, then 
we report the result as an alert. If the results are not conclusive, the event data is stored 
until additional events are observed, triggering a renewed analysis. Of course, since 
the events involve breakins to honeypots, users may also wish to receive informational 
alerts about these events. 

6 HoneyStat in Practice 

To evaluate HoneyStat’s potential as a local worm detection system, we tested two key 
aspects of the algorithm: (a) does it properly identify worm outbreaks, and (b) what 
false positive rate does it produce? Testing showed that HoneyStat could identify worm 
outbreaks, with a low false positive rate. Our testing with available data showed the 
false positive rate of zero. This result is encouraging, given the enormous data set used. 
Nonetheless, a zero false positive rate may be due to properties of the data set and we 
will continue to run more experiments. 

6.1 Worm Detection 

In [QDG+], we used data from six honeypots that became active during the Blaster 
worm outbreak in August 2003. The trace data used for the analysis also included net- 
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work traffic from some 100 /24 darknet IPs. Figure 3 shows an aggregate view of traffic 
to all the honeypots on August 9 and 1 1, as well as background traffic to the darknets. 




Scans to Honeypot Port 44S Scans to Daikret Port 44S 

Scans to Honeypot Pert 1 39 

Scans to Honeypot Port 1 35 > ■ , ■ • 

(a) HoneyStat Worm Detection, ports 135, 
139, 445 




06:20 06:25 



Scans to Honeypot, Port 1 35 ' Scans to Honeypot, Port 6080 

Scans to Honeypot, Port HO Scans From Honeypot, port 135 



(b) Non- Worm Events, ports 135 (left), and 
80, 8080 (right) 



Fig. 3. a) HoneyStat worm detection for Blaster. The Blaster attack on August 11, 2003, is de- 
tected by the honeypots. Upward arrows, not drawn to scale, indicate the presence of outgoing 
traffic from the HoneyStat nodes. Traffic prior to the honeypot activity is analyzed, using the 
inverse of time difference, so that more recent activities more likely explain the change in the 
honeypot. A logit analysis shows that prior scans to port 135 explains these episodes-effectively 
identifying the blaster worm, b) Avoiding false positives. Here, we see a trojaned honeypot node 
becoming active days prior to the Blaster worm outbreak. However, since this event is seen only 
in isolation (one honeypot), it does not trigger a worm alert. Traffic to ports 80 and 8080 does not 
bias the later analysis. 



If we mark the honeypot activity as NetEvents, we can examine the previous 
network activity to find whether a worm is present. As shown in Table 1, a logit analysis 
of the honeypot data shows that of all the variables, port 135 explains the tendency of 
honeypots to become active. (In our particular example, one can even visually confirm 
in Figure 3(a) that honeypot activity took place right after port 135 traffic arrived.) The 
standard error reports the error for the estimated f3, and the significance column reports 
the chance that the variable’s influence was merely chance. The Wald statistic indicates 
whether the /3 statistic significantly differs from zero. The significance column is the 
most critical for our analysis, since it indicates whether the variable’s estimated j3 is 
zero. The lower the score, the less chance the variable had no influence on the value of 
Y . Thus, we eliminate any variable with a significance above a threshold (5%). From 
this, the observations for ports 80, 8080, and 3128 can be discounted as not a significant 
explanation for changes in Y. 

In this case, the logit analysis performs two useful tasks. First, we use the signif- 
icance column to rule out variables above a certain threshold, leaving only ports 135, 
139 and 445. Second, the analysis lets us rank the remaining variables by significance. 

The logit analysis did not pick one individual port as explaining the value of Y. The 
alert that issues therefore identifies three possible causes of the honeypot activity. As 
it turns out, this was a very accurate diagnosis of the Blaster outbreak. Recall that just 
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Table 1. Logit Analysis of Multiple HoneyStat Events. 



Variable 




Standard Error 


Wald 


Significance 


port_80 


-17463.185 


2696276.445 


.000 


.995 


port_135 


3.114 


.967 


10.377 


.001 


port_139 


1869.151 


303.517 


37.925 


.000 


port_445 


-1495.040 


281.165 


28.274 


.000 


port_3128 


-18727.568 


9859594.820 


.000 


.998 


port_8080 


10907.922 


10907.922 


6919861.448 


.999 


constant 


.068 


1.568 


.210 


1.089 



prior to Blaster’s outbreak on port 135, there were numerous scans being directed at 
ports 139 and 445. The port 135 exploit eventually became more popular, since only a 
few machines were vulnerable on 445 and 139. We are aware of no statistical test that 
could focus on port 135 alone, given the high rate of probing being conducted on ports 
139 and 445. This required human insight and domain knowledge to sort out. 

One complicating factor occurs when two zero-day worms attack at the same time, 
using different ports. For example, consider in the example what would happen if traf- 
fic to port 80 were a worm after all. In such a case, more data observations will be 
required to separate out which events support a leading theory of causation. The logit 
analysis will eventually select a pattern for one of the worms, and with the removal of 
those observations, the second worm can be identified. (Recall that once identified, the 
“explained” data is removed from the analysis queue.) Future work will explore the use 
of Best Subsets logistic regression models[HL00], to avoid the linear identification of 
multiple worms. 

The number of observations required for logistic regression appears to be a matter 
of some recent investigation. In [HLOO], the authors (eminent in the field) note “there 
has been surprisingly little work on sample size for logistic regression”. Some rough 
estimates have been supplied. They note that at least one study show a minimum of 10 
events per parameter are needed to avoid over/under estimations of variables [HLOO] . 
Since each honeypot activity observation is paired with a corresponding inactivity ob- 
servation, HoneyStat would need to generate five HoneyStat events to meet this require- 
ment. Section 7 notes how waiting for this many observations potentially affects worm 
detection time. 

Since each event involves an actual compromise of a system, one could also report 
alerts with a lower confidence level. While we might want more samples and certainty, 
we can at the very least rank likely ports in an alert. 

6.2 Benefits of HoneyStat 

HoneyStat provides the following benefits to local networks: (a) It provides a very ac- 
curate data stream for analysis. Every event is the result of a successful attack. This 
significantly reduces the amount of data that must be processed, compared to Kalman 
filter, and other traditional scan-based algorithms, (b) Since HoneyStat uses complete 
operating systems, it detects zero day worms, for which there is no known signature, 
(c) HoneyStat is agnostic about the incoming and outgoing ports for attack packets, as 
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well as their origin. In this way, it can detect worms that enter on port Pa, and exit on 
port Pb. 

Thus, HoneyStat reports an explanation of worm activation, and not merely the 
presence of a worm. Other information, such as rate of scans, can be obtained from 
the traffic logs captured for the logit analysis. [Kre03] has already suggested a simple 
method of quickly extracting a binary signature, in a manner compatible with Honey- 
Stat. 

6.3 False Positive Analysis 

Analyzing the false positive rate for HoneyStat is subtle. Since honeypot events always 
involve breakins and successful exploits, it might seem that honeypot-based alert sys- 
tems would produce no false positives. This is not the case. Although the underlying 
data stream consists of serious alerts (successful attacks on honeypots), we still need 
to analyze the potential for the logit analysis to generate a false positive. Two types of 
errors could occur. First, normal network traffic could be misidentified as the source 
of an attack. That is, a worm could be present, but the analysis may identify other, 
normal traffic as the cause. Second, repeated human breakins could be identified as a 
worm. We do not consider this second failure scenario, since in such a case, the manual 
breakins are robotic in nature, and (for all practical purposes) indistinguishable from, 
and potentially just as dangerous as any worm. 

Model Failure. It is not feasible to test HoneyStat on the Internet. This would require 
waiting for the outbreak of worms, and dedicating a large IP space to a test project. We 
can instead perform an retrospective analysis of a tracefile to estimate the chance of a 
false positive. 

Using a honeypot activity log, dating from July 2002 to March 2004, we used uni- 
form random sampling to collect background traffic samples, and injected a worm at- 
tack. The intuition is this: we wish to see if a HoneyStat logit analysis were to cause 
a false positive. This could occur if normal non-malicious background traffic occurs 
in such a pattern that random sampling produces a candidate solution to the logistic 
regression. 

The data we use for the background sampling came from the Georgia Tech Hon- 
eynet project. We have almost two years of network data captured from the Honeynet. 
The first year of data was captured on a Generation I Honeynet, which is distinguishable 
by the use of a reverse firewall serving as the gateway for all the Honeypots. The second 
year of data was captured from a Generation II Honeynet, which is distinguishable by 
the use of a packet filtering bridge between all of the Honeypots and their gateway. The 
data is available to other researchers in a sanitized form. 

A random sampling of over 250 synthetic honeypot events did not produce a false 
positive. This certainly does not prove that HoneyStat is incapable of producing a false 
positive. Rather, this may reflect the limited range of the data. A much larger data set 
is required to fully explore the potential of logistic regression to misidentify variables. 
Even if false positives are found, it should be noted that these are not the usual false 
positives, or type I errors found in IDS. Instead, a false positive with a HoneyStat node 
is half right: there are breakins to honeypots, even if the algorithm were to misidentify 
the cause. 
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7 HoneyStat as an IDS Tool 

The previous sections have shown that HoneyStat can detect worm attacks with a low 
false positive rate. This shows that it could be incorporated into a local intrusion de- 
tection system. A more important question is whether this strategy can detect worm 
outbreaks early. In this section, we present an analytical model. 

A HoneyStat deployment can effectively detect worms that use random scan tech- 
niques. The work in [WPSC03] presents a complete taxonomy of worms. In [QDG^], 
we discussed how a detection algorithm with a suitably low false positive can protect lo- 
cal networks. Accordingly, we evaluate HoneyStat against worms that use only random 
scanning strategies [ZGGT03]. Since we are interested in local network protections, 
the results should apply to other types of scanning worms [QDG^]. Realistically we 
assume the vulnerable hosts are uniformly distributed in the real assigned IPv4 space 
(all potential victims are located in this space, denoted as T = 10®), not the whole IPv4 
space (denoted as 17 = 2®®). Assume N is the total number of vulnerable machines on 
the Internet, rii is the number of whole Internet victims at time tick i and s is the scan 
rate of worm (per time tick). So the scans entering space T at time tick i + \ should be 
ki+i = srii^. Within this space, the chance of one host being hit is 1 — (1 — . 

Then we have worm propagation equation Eq. ( 2). 

m+i =m + [N {1- (2) 

In fact because T and 17 are very big, (1 — (1 — So the 

spread rate is almost the same as seen in previous models (e.g., Analytical Active Worm 
Propagation (AAWP) model [CGK03], epidemic model [KW93,KCW93,SPN02] etc.) 
Now suppose we have a honeypot network with size D {D C T). The initial num- 
ber of vulnerable hosts is uq. Generally a network with size D has DN/T vulnerable 
hosts on average. But with HoneyStat, each network has its own mix of vulnerable OS 
distributions. Since most worms target Windows, we can intentionally let most of our 
honeypots run Windows so that we present a higher number of initially vulnerable hosts 
to the worm. Without loss of generality we suppose uq = Da. We let a be the min- 
imum ratio for vulnerable hosts. The number of local victims at time tick i is Vi and 
uo = 0 which means initially there is no victim in our honeypot network. The time for 
the first HoneyStat node to become active is ti (clearly ti is the first time tick i when 
Vi > 1). We have 

Vi+i = Mo - (1 - when i + 1 < ti + U (3) 

Here uq = Da. We let D represent the time required to reconfigure most of the non- 
vulnerable honeypots to run the same OS and patch level as the first victim. (In other 
words, this is the time required, say, to convert most of the Linux honeypots to Win- 
dows, if Windows is first attacked.) Since we need more observations for the logit anal- 
ysis to work, as noted in Section 5, we shift some of our honeypots to match the vul- 
nerable OS. As noted above, restarting the suspended OS happens quickly, before any 
traffic times out. We nonetheless express this delay as response time tr. After the delay, 
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the number of new vulnerable hosts becomes u\ = D^. We let 7 represent the maxi- 
mum ratio for vulnerable hosts. Normally 7 < 1 because we may not convert all of our 
HoneyStat nodes to the operating system that is attacked. We might keep a few to detect 
other worms. Now we have 

Vi+i = 1 -I- Ml - (1 - when i + 1 >= h + (4) 

Here ui = D7. We can calculate the time (in terms of the whole Internet infection 
percentage) when our first HoneyStat node is infected. Table 2 and Figure 5 use the 
effect of different a and D. For example, we can see that using D = 2^'^ and a = 10%, 
the first victim is found when only 0.9786% Internet vulnerable hosts are infected. 



Table 2. Time (infection percentage) when HoneyStat network has a first victim. 



a 


D = 2 “ 


D = 2 '" 


D = 2^^ 


D = 2"" 


D = 2"^ 


D = 2"'=' 


D = 2"^ 


D = 2^^ 


D = 2"“ 


10 % 


3.9141% 


1.9558% 


0.9786% 


0.4895% 


0.2448% 


0.1223% 


0.0613% 


0.0307% 


0.0155% 


25% 


1.5634% 


0.7825% 


0.3910% 


0.1959% 


0.0981% 


0.0491% 


0.0247% 


0.0124% 


0.0063% 


50% 


0.7825% 


0.3910% 


0.1959% 


0.0981% 


0.0491% 


0.0247% 


0.0124% 


0.0063% 


0.0033% 


75% 


0.5210% 


0.2606% 


0.1305% 


0.0655% 


0.0328% 


0.0165% 


0.0083% 


0.0043% 


0 . 0022 % 


100 % 


0.3910% 


0.1959% 


0.0981% 


0.0491% 


0.0247% 


0.0124% 


0.0063% 


0.0033% 


0.0017% 




Fig. 4. Effect of HoneyStat network size, D, maximum percentage of vulnerable hosts, 7 , and 
time to redeploy after first victim, tr, on the victim count. These graphs, drawn with a — 0.25; 
N=500,000; scanrate=20 per time tick; Hitlist=l, show that with a larger IP space monitored by 
HoneyStat, D, the detection time (as a precent of the infected Internet) improves greatly. Even 
with only 2^'^ IPs monitored, detection time is quick, requiring only a little more than 1% of the 
Internet to be infected. 



When the first HoneyStat node becomes infected, the other nodes switch to the 
vulnerable OS. This takes time tr, after which there will be ui = D'j vulnerable hosts. 
After redeployment, the chance of getting the next victim improves. This is shown in 
Eq. (4). The effect of D, 7 and tr is shown in Figure 4. Here, we can see that after 
redeployment we will quickly get enough victims when the whole Internet has a low 
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infection percentage. This occurs because we obtain more vulnerable honeypots after 
the HoneyStat array is switched to match the OS of the first victim. Therefore, we get 
higher chances of being hit by the worm. For example, if ct = 0.25, D = 2^®, 7 = 
0.7, tr = 10, it is still very early to have 4 victims in the HoneyStat network, when only 
0.013% Internet vulnerable hosts are infected. To have 10 victims, still only 0.0261% 
Internet vulnerable hosts are infected. And we can see that tr = 10 or tj. = 100 , 7 = 0.7 
or 7 = 0.9 do not affect the outcome very much. Instead, the size of honeynet D is the 
most important factor. Thus, the delay in switching HoneyStat nodes does not play a 
critical role in overall worm detection time. 




Fig. 5. Effect of a and D on Time (Infection Percentage) when HoneyStat network has a first 
victim. N=500,000, Scan rate=20 per time tick. 



In section 4, we noted that machines can be massively multihomed, so that one 
piece of hardware can effectively handle hundreds of IP addresses in multiple virtual 
machines. From the discussion of D above, 2^^ is already a reasonable number of IP 
addresses that can be used in our local early worm detection. Assuming we had a few 
computers sufficient to allow D = 2^^ and a = 0.25, we can see from Table 2 that the 
first victim appears when on average 0. 1959% of Internet vulnerable hosts are infected. 
Suppose 7 = 0.75, = 10, then to have 5 victims in our honeynet (or enough to have 

a minimal number of data points suggested in Section 6 ), it is still very early when only 
0.4794% of the Internet’s vulnerable hosts are infected. When one IP is infected, we 
reset the OS so that it can be infected again. This kind of “replacement” policy makes 
the whole honeynet work as we have discussed above although there are only, say, 64 
virtual machines running on every GSX server. 



8 Conclusion 

Local detection systems deserve further exploration. We have suggested that in addition 
to increasing the quantity of data used by alert systems, the quality can be improved as 
well. It has been said that if intrusion detection is like finding a needle in a haysfack, 
fhen a honeypof is like a stack of needles. Literally every event from a honeypot is 
noteworthy. Honeypots are therefore used to create a highly accurate alert stream. 
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Using logistic regression, we have shown how a honeypot alert stream can detect 
worm outbreaks. We dehne three classes of events to capture memory, disk and network 
activities of worms. The logit analysis can eliminate noise sampled during these events, 
and identify a likely list of causes. Using extensive data traces of previous worm events, 
we have demonstrated that HoneyStat can identify worm activity. An analytical model 
suggests that, with enough multihomed honeypots, this provides an effective way to 
detect worms early. 

While participation in global monitoring efforts has value, we believe local network 
strategies require exploration as well. Further work could include identihcation of addi- 
tional logistic models to sort through large sets of data, coordination of shared honeypot 
events, integration with other intrusion detection techniques, and response. 
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Abstract. Worm detection and response systems must act quickly to 
identify and quarantine scanning worms, as when left unchecked such 
worms have been able to infect the majority of vulnerable hosts on the 
Internet in a matter of minutes [9]. We present a hybrid approach to de- 
tecting scanning worms that integrates significant improvements we have 
made to two existing techniques; sequential hypothesis testing and con- 
nection rate limiting. Our results show that this two-pronged approach 
successfully restricts the number of scans that a worm can complete, is 
highly effective, and has a low false alarm rate. 



1 Introduction 

Human reaction times are inadequate for detecting and responding to fast scan- 
ning worms, such as Slammer, which can infect the majority of vulnerable sys- 
tems in a matter of minutes [18, 9]. Thus, today’s worm response proposals focus 
on automated responses to worms, such as quarantining infected machines [10], 
automatic generation and installation of patches [14, 15], and reducing the rate at 
which worms can issue connection requests so that a more carefully constructed 
response can be crafted [22,27]. 

Even an automated response will be of little use if it fails to be triggered 
quickly after a host is infected. Infected hosts with high-bandwidth network con- 
nections can initiate thousands of connection requests per second, each of which 
has the potential to spread the infection. On the other hand, an automated 
response that triggers too easily will erroneously identify hosts as infected, in- 
terfering with these hosts’ reliable performance and causing significant damage. 

Many scan detection mechanisms rely upon the observation that only a small 
fraction of addresses are likely to respond to a connection request at any given 
port. Many IPv4 addresses are dead ends as they are not assigned to active hosts. 
Others are assigned to hosts behind firewalls that block the port addressed by the 
scanner. When connection requests do reach active hosts, many will be rejected 
as not all hosts will be running the targeted service. Thus, scanners are likely to 
have a low rate of successful connections, whereas benign hosts, which only issue 
connection requests when there is reason to believe that addressees will respond, 
will have a much greater rate of success. 
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Existing methods for detecting scanning worms within a local network use 
fixed thresholds for the number of allowable failed connections over a time pe- 
riod [16] or limit the rate at which a host can initiate contact with additional 
hosts [27]. However, these threshold based approaches may fail to detect low-rate 
scanning. They may also require an excessive number of connection observations 
to detect an infection or lead to an unnecessary number of false alarms. 

To detect inbound scans initiated by hosts outside the local network, pre- 
vious work on which we collaborated [7] used an approach based on sequential 
hypothesis testing. This approach automatically adjusts the number of observa- 
tions required to detect a scan with the strength of the evidence supporting the 
hypothesis that the observed host is, in fact, scanning. The advantage of this 
approach is that it can reduce the number of connection requests that must be 
observed to detect that a remote host is scanning while maintaining an accept- 
able false alarm rate. 

While this approach shows promise for quickly detecting scanning by hosts 
inside a local network soon after they have been infected by a worm, there 
are significant hurdles to overcome. For one, to determine whether a request to 
connect to a remote host will fail, one must often wait to see whether a successful 
connection response will be returned. Until enough connection requests can be 
established to be failures, a sequential hypothesis test will lack the observations 
required to conclude that the system is infected. By the time the decision to 
quarantine the host is made, a worm with a high scan rate may have already 
targeted thousands of other hosts. 

This earlier work used a single sequential hypothesis test per host and did 
not re-evaluate benign hosts over time. Unlike an intrusion detection system ob- 
serving remote hosts, a worm detection system is likely to observe benign traffic 
originating from an infected host before it is infected. It is therefore necessary 
to adapt this method to continuously monitor hosts for indications of scanning. 

We introduce an innovative approach that enables a Worm Detection System 
(WDS) to continuously monitor a set of local hosts for infection, requiring a small 
number of observations to be collected after an infection to detect that the host 
is scanning (Figure 1). 

To detect infected hosts, the WDS need only process a small fraction of 
network events; a subset of connection request observations that we call first- 




Fig. 1. A Worm Detection System (WDS) is located to monitor a local network 
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contact connection requests and the responses to these requests that complete 
the connections. A first-contact connection request is a packet (TCP or UDP) 
addressed to a host with which the sender has not previously communicated. 
These events are monitored because scans are mostly composed of first-contact 
connection requests. 

In Section 2, we introduce a scan detection algorithm that we call a reverse 
sequential hypothesis test (HT), and show how it can reduce the number of first- 
contact connections that must be observed to detect scanning^. Unlike previous 
methods, the number of observations ifT requires to detect hosts’ scanning 
behavior is not affected by the presence of benign network activity that may be 
observed before scanning begins. 

In Section 3, we introduce a new credit-based algorithm for limiting the 
rate at which a host may issue the first-contact connections that are indica- 
tive of scanning activity. This credit-based connection rate limiting (CBCRL) 
algorithm results in significantly fewer false positives (unnecessary rate limiting) 
than existing approaches. 

When combined, this two-pronged approach is effective because these two 
algorithms are complementary. Without credit-based connection rate limiting, a 
worm could rapidly issue thousands of connection requests before enough con- 
nection failures have been observed by Reverse Sequential Hypothesis Testing so 
that it can report the worm’s presence. Because Reverse Sequential Hypothesis 
Testing processes connection success and failure events in the order that con- 
nection requests are issued, false alarms are less likely to occur than if we used 
an approach purely based on credit-based connection rate limiting, for which 
first-contact connections attempts are assumed to fail until the evidence proves 
otherwise. 

We demonstrate the utility of these combined algorithms with trace-driven 
simulations, described in Section 4, with results presented in Section 5. The 
limitations of our approach, including strategies that worms could attempt to 
avoid detection, are presented in Section 6. We discuss related work, including 
previous approaches to the scanning worm detection problem, in Section 7. Our 
plans for future work are presented in Section 8, and we conclude in Section 9. 

2 Detecting Scanning Worms 

by Using Reverse Sequential Hypothesis Testing 

A worm is a form of malware that spreads from host to host without human 
intervention. A scanning worm locates vulnerable hosts by generating a list of 
addresses to probe and then contacting them. This address list may be gener- 
ated sequentially or pseudo-randomly. Local addresses are often preferentially 
selected [25] as communication between neighboring hosts will likely encounter 
fewer defenses. Scans may take the form of TCP connection requests (SYN pack- 
ets) or UDP packets. In the case of the connectionless UDP protocol, it is possible 

^ The letters in this abbreviation, HT, stand for Hypothesis Testing and the arrow 
indicates the reverse sequential order in which observations are processed. 
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for the scanning packet to also contain the body of the worm as was the case 
with the Slammer worm [9]. 

In this section, we present an on-line algorithm for detecting the presence of 
scanners within a local network by observing network traffic. We use a sequential 
hypothesis test for its ability to adjust the number of observations required to 
make a decision to match the strength of the evidence it is presented with. 



2.1 Sequential Hypothesis Testing 

As with existing approaches to scan detection [7,17,22,27], we rely upon the 
observation that only a small fraction of addresses are likely to respond to a 
connection request at any given port. Benign hosts, which only contact systems 
when they have reason to believe that this connection request will be accepted, 
are more likely to receive a response to a connection request. 

Recall that a first-contact connection request is a packet (TCP or UDP) 
addressed to a host with which the sender has not previously communicated. 
When a local host I initiates a first-contact connection request to a destination 
address, d, we classify the outcome as either a “success” or a “failure”. If the 
request was a TCP SYN packet, the connection is said to succeed if a SYN-ACK 
is received from d before a timeout expires. If the request is a UDP packet, 
any UDP packet from d received before the timeout will do. We let be a 
random (indicator) variable that represents the outcome of the first-contact 
connection request by I, where 

Y — if the connection succeeds 
* ( 1 if the connection fails 

Detecting scanning by local hosts is a problem that is well suited for the 
method of sequential hypothesis testing first developed by Wald [24], and used 
in our earlier work to detect remote scanners [7]. 

We call Hi the hypothesis that host I is engaged in scanning (indicating 
infection by a worm) and Hg the null hypothesis that the host is not scanning. We 

assume that, conditional on the hypothesis Hj, the random variables Yi\Hj i = 

1,2,... are independent and identically distributed (i.i.d.). That is, conditional 
on the hypothesis, any two connection attempts will have the same likelihood 
of succeeding, and their chances of success are unrelated to each other. We can 
express the distribution of the Bernoulli random variable Yi as: 

Pr[Y, = 0|i7o] = 00, Pr[h; = l|i?o] = I ~ 0o 

Pr[Y, = 0|iJi] = 01, Pr[Y, = l|iJi] = 1 - 01 

Given that connections originating at benign hosts are more likely to succeed 
than those initiated by a scanner, 0 q > 0i. 

Sequential hypothesis testing chooses between two hypotheses by comparing 
the likelihoods that the model would generate the observed sequence of events, 
Y„ = (Yi, . . . , Y„), under each hypothesis. It does this by maintaining the ratio 
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vl(Y„), the numerator of which is the likelihood that the model would generate 
the sequence of events Y„ under hypothesis and the denominator under 
hypothesis Hq. 



A{Yr,) 



Pr[Y„|gi] 

Pr[Y„|ifo] 



( 1 ) 



The i.i.d. assumption in the model enables us to state this ratio in terms of 
the likelihoods of the individual events. 



^(Y„) = n 



PrK|gi] 

P4Y^\Ho] 



( 2 ) 



We can write the change to yl(Y„) as a result of the i‘^observation as <^(Y): 



m) = 



Pr[Y|gi] 

Y4Y^\Ho] 



^ if Y = 0 (success) 
if Y = 1 (failure) 



This enables us to rewrite vl(Yji) inductively, such that d.(Yo) = 1, and 
yl(Y„) may be calculated iteratively as each observation arrives. 



^(Y„) = = ^(Y„-l)</)(Y^) 

i=l 



One compares the likelihood ratio vl(Y„) to an upper threshold, 771, above 
which we accept hypothesis Hi, and a lower threshold, tjq, below which we accept 
hypothesis Hq. If rjQ < yl(Y„) < 771 then the result will remain inconclusive until 
more events in the sequence can be evaluated. This is illustrated in Figure 2. 






^0 

^ Y3 Y, Y3 

Fig. 2. A log scale graph of A(Y) as each observation, Y, is added to the sequence. 
Each success (0) observation decreases A(Y), moving it closer to the benign conclusion 
threshold 770, whereas each failure (1) observation increases A(Y), moving it closer to 
the infection conclusion threshold 771 
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Writing the probability of correctly reporting detection (declaring host is 
infected when indeed it is) as Pd and the probability of a false positive (declaring 
host is infected when in fact it is not) as Pp, we can define our performance 
requirements as bounds a and (3 on these probabilities. 

a > Pp and (3 < Pp 
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Because every false positive can decrease productivity of both the users of a host 
and the security staff who need to inspect it, one would expect to use a values 
that are small fractions of a percentage point. Since scanners generate enough 
traffic to clearly differentiate their behavior from that of benign systems, a /3 of 
greater than 0.99 should be an achievable requirement. 

Wald [24] showed that r/i and rjo can be bounded in terms of Pd and Pp- 



Vi < 



P, 



D 



f-Pp 

1-Pf 



< m 



( 3 ) 

( 4 ) 



Given our requirement parameters a and 13, we assign the following values 
to our thresholds, rjo and rji: 



m 




Vo 



1-/3 

I — a 



( 5 ) 

(6) 



From Equations (3) and (5), we can bound Pp in terms of a and j3. Since 
0 < Pd < 1, we can replace Pp with 1 in Equation (3) to yield: 



<r ^ 1 

- Pd ^ Pp 



( 7 ) 



It follows that: 

Pd < - = ^ (8) 

Vi P 

Likewise, using Equation (4) and given that 1 — Pd < (1 — Pd)/(1 — Pp), we 
can bound 1 — Pp ■ 

I - Pp < rjo = — - (9) 

I — a 

While rji may result in a false positive rate above our desired bound by a 
factor of this difference is negligible given our use of (3 values in the range of 
0.99 and above. Similarly, while our miss rate, 1 — Pp may be off by as much 
as a factor of this too will have negligible effect given our requirements for 
very small values of a. 



2.2 Detecting Infection Events 

In our earlier work, it was assumed that each remote host was either a scanner or 
benign for the duration of the observed period. When a host was determined to be 
benign it would no longer be observed. In contrast, in this paper we are concerned 
with detecting infection events, in which a local host transitions from a benign 
state to an infected state. Should a host become infected while a hypothesis test 
is already running, the set of outcomes observed by the sequential hypothesis 
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test may include those from both the benign and infected states, as shown in 
Figure 3. Even if we continue to observe the host and start a new hypothesis 
test each time a benign conclusion is reached, the test may take longer than 
necessary to conclude that an infection has occurred. 



infection— 
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Fig. 3. A log scale graph tracing the value of d(Y) as it is updated for a series of 
observations that includes first-contact connection requests before (Y_i and Y_ 2 ) and 
after {Yi and beyond) the host was infected 



The solution to this problem is to run a new sequential hypothesis test as each 
connection outcome is observed, evaluating these outcomes in reverse chronolog- 
ical order, as illustrated in Figure 4. To detect a host that was infected before it 
issued first-contact connection i (event Yi), but after it had issued first-contact 
connection f — 1, a reverse sequential hypothesis test (hT) would require the 
same number of observations to detect the infection as would a forward sequen- 
tial hypothesis that had started observing the sequence at observation i. Because 
the most recent observations are processed first, the reverse test will terminate 
before reaching the observations that were collected before infection. 
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Fig. 4. A log scale graph tracing the value of A(Y+ 5 , Yi+i , . . .), in which the observa- 
tions in Y are processed in reverse sequential order. The most recent, or rightmost, 
observation is the first one processed 



When we used sequential hypothesis testing in our prior work to detect scan- 
ning of a local network by remote hosts, the intrusion detection system could 
know a priori whether a connection would fail given its knowledge of the net- 
work topology and services [7] . Thus, the outcome of a connection request from 
host i could immediately be classified as a success or failure observation (Yi) and 
+(Y„) could be evaluated without delay. 

When a local host initiates first-contact connection requests to remote hosts, 
such as those shown in Figure 5, the worm detection system cannot immediately 
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determine if the connection will succeed or fail. While some connection failures 
will result in a TCP RST packet or an ICMP packet [1,3], empirical evidence has 
shown that most do not [2] . The remaining connection attempts can be classified 
as failures only after a timeout expires. 

local host WDS 




Fig. 5. The success of first-contact connection requests by a local host to remote hosts 
cannot be established by the Worm Detection System (WDS) until a response is ob- 
served or a timeout expires 

While a sequential hypothesis test waits for unsuccessful connections to time 
out, a worm may send thousands of additional connection requests with which 
to infect other systems. To limit the number of outgoing first-contact connec- 
tions, a sequential hypothesis testing approach can be paired with a credit-based 
connection rate limiter as described in Section 3. 

2.3 Algorithmic Implementation 

A naive implementation of repeated reverse sequential hypothesis testing re- 
quires that we store an arbitrarily large sequence of first-contact connection 
observations. A naive implementation must also step through a portion of this 
sequence each time a new observation is received in order to run a new test 
starting at that observation. 

Fortunately, there exists an iterative function: 

A(Y„) = max(l, A(Y„_i)</)(r„)) 

with state variable yl(Y„), that can be calculated in the sequence in which events 
are observed, and that has the property that its value will exceed 771 if and only 
if a reverse sequential hypothesis test would conclude from this sequence that 
the host was infected. This is proved in Appendix A. 

Updating A for each observation requires only a single multiplication and two 
comparison operations^. Because A is updated in sequence, observations can be 
discarded immediately after they are used to update the value of A. 

^ In fact, addition and subtraction operations are adequate as the iterative function is 
equivalent to 0(Y„) = max (0, 6>(Y„_i) -|- In 0(W)) where 0(Y„) = In /[(Yn). 
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enum status {PENDING, SUCCESS, FAILURE}; 
struct FCC_Queue_Entry { 
ip4_addr DestAddr; 
time Wheninitiated; 
status Status; 



Fig. 6. The structure of entries in the First-Contact Connection (FCC) queue 

When running this algorithm in a worm detection system, we must maintain 
separate state information for each host being monitored. Thus, a state variable 
Ai is maintained for each local host 1. 

It is also necessary to track which hosts have been previously contacted by 1. 
We track the set of Previously Contacted Hosts, or PCH set, for each local host. 

Finally, each local host I has an associated queue of the first-contact con- 
nection attempts that I has issued but that have not yet been processed as 
observations. The structure of the records that are pushed on this FCC queue are 
shown in Figure 6. The choice of a queue for this data structure ensures that 
first-contact connection attempts are processed in the order in which they are 
issued, not in the order in which their status is determined. 

The algorithm itself is quite simple and is triggered upon one of three events. 

1. When the worm detection system observes a packet (TCP SYN or UDP) 
sent by local host I, it checks to see if the destination address d is in Vs 
previously contacted host (PCH) set. If it isn’t, it adds d to the PCH set and 
adds a new entry to the end of the FCC queue with d as the destination 
address and status PENDING. 

2. When an incoming packet arrives addressed to local host I and the source 
address is also the destination address (DestAddr) of a record in I’s FCC 
queue, the packet is interpreted as a response to the first-contact connection 
request and the status of the FCC record is updated. The status of the FCC 
record is set to SUCCESS unless the packet is a TCP RST packet, which 
indicates a rejected connection. 

3. Whenever the entry on the front of the FCC queue has status PENDING and 
has been in the queue longer than the connection timeout period, a timeout 
occurs and the entry is assigned the status of FAILURE. 

When any of the above events causes the entry at the front of the FCC queue to 
have status other than PENDING, it is dequeued and Ai is updated and compared 
to rji. If Ai > rji, we halt testing for host I and immediately conclude that I 
is infected. Dequeuing continues so long as Ai < rji, the front entry of the FCC 
queue has status other than PENDING, and the queue is not empty. 

3 Slowing Worm Propagation 

by Using Credit-Based Connection Rate Limiting 

It is necessary to limit the rate at which first-contact connections can be initiated 
in order to ensure that worms cannot propagate rapidly between the moment 
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scanning begins and the time at which the scan’s first-contact connections have 
timed out and been observed by our reverse sequential hypothesis test (HT). 

Twycross and Williamson [27, 22] use a technique they call a virus throttle 
to limit outgoing first-contact connections. When observing a given host, their 
algorithm maintains a working set of up to five hosts previously contacted by the 
host they are observing. For the purpose of their work, a first-contact connection 
is a connection to a host not in this working set. First-contact connections issued 
when the working set is full are not sent out, but instead added to a queue. Once 
per second the least recently used entry in the working set is removed and, if 
the pending queue of first-contact connection requests is not empty, a request 
is pulled off the queue, delivered, and its destination address is added to the 
working set. All requests in the queue with the same destination address are 
also removed from the queue and delivered. 

Virus throttling is likely to interfere with HTTP connection requests for 
inlined images, as many Web pages contain ten or more inlined images each of 
which is located on a distinct peering server. While a slow but bursty stream 
of requests from a Web browser will eventually be released by the throttle, 
mail servers, Web crawlers, and other legitimate services that issue first-contact 
connections at a rate greater than once per second will overflow the queue. In this 
case, the virus throttling algorithm quarantines the host and allows no further 
first-contact connections. 

To achieve rate limiting with a better false positive rate we once again present 
a solution inspired by sequential hypothesis testing and that relies on the obser- 
vation that benign first-contact connections are likely to succeed whereas those 
issued by scanners are likely to fail. This credit-based approach, however, is 
unlike HT in that it assumes that a connection will fail until evidence proves 
otherwise. Because it does not wait for a timeouts to act, it can react immedi- 
ately to a burst of connections and halt the flow so that St can then make a 
more informed decision as to whether the host is infected. As it does not force 
connections to be evaluated in order, CBCRL can also immediately process evi- 
dence of connection successes. This will enable it to quickly increase the allowed 
first-contact connection rate when these requests are benign. 

Credit-based connection rate limiting, as summarized in Figure 7, works by 
allocating to each local host, I, a starting balance of ten credits (C; <— 10) 
which can be used for issuing first-contact connection requests. Whenever a first- 
contact connection request is observed, a credit is subtracted from the sending 
host’s balance (C/ <— C; — 1). If the successful acknowledgment of a first-contact 
connection is observed, the host that initiated the request is issued two additional 
credits {Ci ^ Ci + 2). No action is taken when connections fail, as the cost of 
issuing a first-contact connection has already been deducted from the issuing 
host’s balance. Finally, first-contact connection requests are blocked if the host 
does not have any credit available (C/ = 0)^. 



® In Section 8, we discuss the alternative of allowing all TCP requests to be transmitted 
and queueing responses until credits are available. 
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Event 


Change to Ci 


Starting balance 


Cl ^ 10 


FCC issued by 1 


C, ^ C, - 1 


FCC succeeds 


Cl ^ Cl +2 


Every second 


Cl <— max(10, fCi) if Ci > 10 


Allowance 


Ci <— 1 if C( = 0 for 4 seconds 



Fig. 7. The underlying equations behind credit-based connection rate limiting. Changes 
to a host’s balance are triggered by the first-contact connections (FCCs) it initiates 
and by the passing of time 



If a first-contact connection succeeds with probability 8, its expected payoff 
from issuing that connection is its expected success credit minus its cost, or 20—1. 
This payoff is positive for 0 > 5 and negative otherwise. Hosts that scan with 
a low rate of successful connections will quickly consume their credits whereas 
benign hosts that issue first-contact connections with high rates of success will 
nearly double their credits each time they invest them. 

As described so far, the algorithm could result in two undesirable states. 
First, a host could acquire a large number of credits while performing a benign 
activity (e.g. Web crawling) which could be used later by a scanning worm. 
Second, a network outage could cause a benign host to use all of its credits after 
which it would starve for a lack of first-contact connection successes. 

These problems are addressed by providing each host with a small allowance 
and by putting in place a high rate of inflation. If a host has been without credits 
for four seconds, we issue the host a single credit (C/ ^ 1 if C/ < 0). This not 
only ensures that the host does not starve, bu t enables us to collect another 
observation to feed into our hypothesis test (HT). Because &T, as configured in 
Section 4, observes all first-contact connection requests as successes or failures 
within three seconds, providing a starving process with a credit allowance only 
after more than three seconds have passed ensures that *HT will have been 
executed on all previously issued first-contact connection requests. If ‘WT has 
already concluded that the host is a worm, it is expected that the system will 
be quarantined and so no requests will reach their destination regardless of the 
credit balance. 

For each second that passes, a host that has acquired more than 10 credits 
will be forced to surrender up to a third of them, but not so many as to take 
its balance below 10 {Ci ^ max(10, |C/) if Ci > 10). A host that is subject to 
the maximum inflation rate, with a first-contact connection rate r, success rate 
0 > 0, and credit balance Ci^t at time t, will see this balance reach an equilibrium 
state C when C = Ci^t = 



G.t+i = ^(G,t+r-(20-l)) 

C=l{C + r-{28-l)) 
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C=^C+yr-{29-l) 

C=2-r-{29-l) 

One can now see that we chose the inflation constant | to ensure that, in the 
upcoming second, a host that has a perfect first-contact connection success rate 
(0=1) will have twice as many credits as it could have needed in the previous 
second. Also note that the maximum inflation rate, which seems quite steep, 
is only fully applied when C > 15, which in turn occurs only when the first- 
contact connection rate r is greater than 7.5 requests per second. Twycross and 
Williamson’s virus throttle, on the other hand, can only assume that any host 
with a first-contact connection rate consistently greater than one request per 
second is a worm. 

The constant of 10 was chosen for the starting credit balance (and for the 
equilibrium minimum credit balance for benign hosts with first-contact connec- 
tion rates below 5 requests/second) in order to match the requirements of our 
sequential hypothesis test (St) as currently configured (see parameters in Sec- 
tion 4), which itself requires a minimum of 10 observations in order to conclude 
that a host is engaged in scanning. Slowing the rate at which the first 10 observa- 
tions can be obtained will only delay the time required by ‘WT to conclude that 
a host is engaged in scanning. Should the parameters of nT be reconfigured and 
the minimum number of observations required to conclude a host is a scanner 
change, the starting credit balance for rate-limiting can be changed to match it. 

4 Experimental Setup 

We evaluated our algorithms using two traces collected at the peering link of a 
medium sized ISP; one collected in April 2003 (isp-03) containing 404 active 
hosts and the other in January 2004 (isp-04) containing 451 active hosts. These 
traces, summarized in Table 1, were collected using tcpdump. 

Obtaining usable traces was quite difficult. Due to privacy concerns, network 
administrators are particularly loathe to share traces, let alone those that contain 
payload data in addition to headers. Yet, we required the payload data in order 
to manually determine which, if any, worm was present on a host that was flagged 
as infected. 

To best simulate use of our algorithm in a worm detection system that is 
used to quarantine hosts, we only tested local hosts for infection. Remote hosts 
were not tested. 

In configuring our reverse sequential hypothesis test 0TT), first-contact con- 
nection requests were interpreted as failures if they were not acknowledged within 
a three second grace period. First-contact connection requests for which TCP 
RST packets were received in response were immediately reported as failure 
observations. Connection success probability estimates were chosen to be: 

9o = 0.7 01 = 0.1 
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Table 1. Summary of network traces 





isp-03 


[ isp-04 


Date 


2003/04/10 


2004/01/28 


Duration 


627 minutes 


66 minutes 


Total outbound 
connection attempts 


1,402,178 


178,518 


Total active local host 


404 


451 



Confidence requirements were set to: 

a = 0.00005 /3 = 0.99 

Note that these confidence requirements are for each reverse sequential hy- 
pothesis test, and that a test is performed for each first-contact connection that 
is observed. Therefore, the false positive rate is chosen to be particularly low as 
testing will occur many times for each host. 

For each local host we maintained a Previously Contacted Host (PCH) set of 
only the last 64 destination addresses that each local host had communicated 
with (LRU replacement). For the sake of the experiment, a first-contact connec- 
tion request was any TCP SYN packet or UDP packet addressed to a host that 
was not in the local host’s PCH set. While using a fixed sized PCH set demonstrates 
the efficacy of our test under the memory constraints that are likely to occur 
when observing large (e.g. class B) networks, this fixed memory usage comes at 
a cost. As described in Section 6, it is possible for a worm to exploit limitations 
in the PCH set size in order to avoid having its scans detected. 

For sake of comparison, we also implemented Twycross and Williamson’s 
‘virus throttle’ as described in [22]. Since our traces contain only those packets 
seen at the peering point, our results may differ from a virus throttle imple- 
mented at each local host as Twycross and Williamson recommend. However, 
because observing connections farther from the host results in a reduction in 
the number of connections observed, it should only act to reduce the reported 
number of false positives in which benign behavior is throttled. 

All algorithms were implemented in Perl, and used traces that had been 
pre-processed by the Bro Network Intrusion Detection System [13, 12]. 

We did not observe FTP-DATA, finger, and IDENT connections as these con- 
nections are the result of local hosts responding to remote hosts, and are not 
likely to be accepted by a host that has not issued a request for such a connection. 
These connections are thus unlikely to be useful for worm propagation. 



5 Results 

Our reverse sequential hypothesis test detected two hosts infected with CodeRed 
II [4, 20] from the April, 2003 trace (isp-03). Our test detected one host infected 
with Blaster/Lovsan [5], three hosts infected with MyDoom/Novarg [11,21], and 
one host infected with Minmail.j [6] from the January, 2004 trace (isp-04). 
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Table 2. Alarms reported by reverse sequential hypothesis testing combined with 
credit-based rate limiting. The cause of each alarm was later identified manually by 
comparing observed traffic to signature behaviors described at online virus libraries 





isp-03 


isp-04 


Worms/Scanners detected 
CodeRed II 


2 


0 


Blaster 


0 


1 


MyDoom 


0 


3 


Minmail . j 


0 


1 


HTTP (other) 


3 


1 


Total 


5 


6 


False alarms 

HTTP 


0 


3 


SMTP 


0 


3 


Total 


0 


6 


P2P detected 


6 


11 


Total identified 


11 


23 



Table 3. Alarms reported by virus throttling 





isp-03 


isp-04 


Worms/Scanners detected 
CodeRed II 


2 


0 


MyDoom 


0 


1 


HTTP (other) 


1 


1 


Total 


3 


2 


False alarms 


0 


0 


P2P detected 


2 


3 


Total identified 


5 


5 



The worms were conclusively identified by painstakingly comparing the logged 
traffic with the cited worm descriptions at various online virus/ worm information 
libraries. Our test also identified four additional hosts that we classify as HTTP 
scanners because each sent SYN packets to port 80 of at least 290 addresses 
within a single class B network. These results are summarized in Table 2. 

While peer-to-peer applications are not necessarily malicious, many network 
administrators would be loathe to classify them as benign. Peer-to-peer file shar- 
ing applications also exhibit ambiguous network behavior, as they attempt to 
contact a large number of transient peers that are often unwilling or unavailable 
to respond to connection requests. While peer-to-peer clients are deemed unde- 
sirable on most of the corporate networks that we envision our approach being 
used to protect, it would be unfair to classify these hosts as infected. For this 
reason we place hosts that we detect running peer-to-peer applications into their 
own category. Even if detections of these hosts are classified as false alarms, the 
number of alarms is manageable. 
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Table 4. Composite results for both traces. A total of 7 HTTP scanning worms and 
5 email worms were present 





Alarms 


Detection 


Efficiency 


Effectiveness 


Uf 


34 


11 


0.324 


0.917 


virus-throttling 


10 


5 


0.500 


0.417 



Table 5. Comparison of rate limiting by credit-based connection rate limiting 
(CBCRL) vs. a virus throttle. Unnecessary rate limiting means that CBCRL dropped 
at least one packet from a host. For virus throttling, we only classify a host as rate 
limited if the delay queue reaches a length greater than five 





CBCRL 


[Virus Throttling 




isp-03 


isp-04 


isp-03 


isp-04 


Worms/Scanners 


5 


1 


3 


4 


P2P 


4 


8 


3 


7 


Unnecessary rate limiting 


0 


0 


84 


59 



Three additional false alarms were reported for three of the 60 (isp-04) total 
hosts transmitting SMTP traffic. We suspect the false alarms are the result of 
bulk retransmission of those emails that have previously failed when the recipi- 
ents’ mail servers were unreachable. We suggest that organizations may want to 
white-list their SMTP servers, or significantly increase the detection thresholds 
for this protocol. 

The remaining three false alarms are specific to the isp-04 trace, and resulted 
from HTTP traffic. It appears that these false alarms were raised because of 
a temporary outage at a destination network at which multiple remote hosts 
became unresponsive. These may have included servers used to serve inlined 
images. 

Upon discovering these failures, we came to realize that it would be possible 
for an adversary to create Web sites that served pages with large numbers of 
inlined image tags linked to non-responsive addresses. If embedded with scripts, 
these sites might even be designed to perform scanning of the client’s network 
from the server. Regardless, any client visiting such a site would appear to be 
engaged in HTTP scanning. To prevent such denial of service attacks from ren- 
dering a worm detection system unusable, we require a mechanism for enabling 
users to deactivate quarantines triggered by HTTP requests. We propose that 
HTTP requests from such hosts be redirected to a site that uses a CAPTCHA 
(Completely Automated Public Turing Test to Tell Computers and Humans 
Apart [23]), to confirm that a user is present and was using a Web browser at 
the time of quarantine. 

Results for our implementation of Twycross and Williamson’s virus throt- 
tle [22] are summarized in Table 3. Their algorithm blocked both instances of 
CodeRed II, but failed to detect Blaster, three instances of MyDoom (which is 
admittedly an email worm and not an IP scanning worm), and two low rate 
HTTP scanners. It did, however, detect one host infected with MyDoom that tlT 
failed to detect. The virus throttle also detected fewer hosts running peer-to-peer 
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Table 6. The number of first-contact connections permitted before hosts were reported 
as infected. The value pairs represent individual results for two different CodeRed II 
infections and two different HTTP scanners 





St with CBCRL 


Virus Throttling 


CodeRed II 


10,10 


6,7 


Other HTTP scanners 


10,10 


102,526 



applications, which for fairness we classify as a reduction in false alarms in virus 
throttling’s favor in our composite results summarized in Table 4. 

These composite results for both traces report the number of hosts that 
resulted in alarms and the number of those alarms that were detections of the 12 
worms located in our traces. We also include the ejjiciency, which is the number 
of detections over the total number of alarms, and the effectiveness, which is 
the total number of detections over the total number of infected hosts we have 
found in these traces. While St is somewhat less efficient than virus throttling, 
the more than two-fold increase in effectiveness is well worth the trade-off. In 
addition, corporate networks that forbid peer-to-peer file sharing applications 
will see a two-fold increase in efficiency. 

Table 5 shows the number of hosts that had connection requests blocked by 
our credit-based algorithm and the number of hosts that were rate limited by 
Twycross and Williamson’s algorithm. For credit-based connection rate limiting, 
we say that a machine has been rate limited if a single packet is dropped. For 
the virus throttle, we say that a machine has been rate limited if the outgoing 
delay queue length is greater than five, giving Twycross and Williamson the 
benefit of the doubt that users won’t notice unless connections are severely 
throttled. Our credit-based algorithm only limited the rates of hosts that our 
reverse sequential hypothesis test reported as infected. In contrast, even given 
our generous definition, more than 10% of the hosts in both traces were rate 
limited by Twycross and Williamson’s algorithm. 

Table 6 reports the number of first-contact connections permitted by the two 
approaches for those scanners that both detected. CodeRed II is a fast scanner, 
and so virus throttling excels in blocking it after 6 to 7 connection requests. 
This speed is expected to come at the price of detecting any service, malicious 
or benign, that issues high-rate first-contact connections. 

Reverse Sequential Hypothesis Testing with credit-based connection rate lim- 
iting detects worms after a somewhat higher number of first-contact connections 
are permitted (10), but does so regardless of the scanning rate. Whereas our ap- 
proach detects a slow HTTP scanner after 10 first-contact connection requests, 
the virus throttle requires as many as 526. 

6 Limitations 

Credit-based connection rate limiting is resilient to network uplink outages as 
hosts starved for credits will receive an allowance credit seconds after the network 
is repaired. Unfortunately, this will be of little consolation as Reverse Sequential 
Hypothesis Testing(Sr) may have already concluded that all hosts are scanners. 
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This may not be a problem if network administrators are given the power to 
invalidate observations made during the outage period, and to automatically 
reverse any quarantining decisions that would not have been taken without these 
invalid observations. 

Of greater concern is that both Reverse Sequential Hypothesis Testing and 
credit-based connection rate limiting rely exclusively on the observation that 
hosts engaged in scanning will have lower first-contact connection success rates 
than benign hosts. New hypotheses and tests are required to detect worms for 
which this statistical relationship does not hold. 

In particular, our approach is not likely to detect a topological worm, which 
scans for new victim hosts by generating a list of addresses that the infected host 
has already contacted. Nor is our approach likely to detect flash worms, which 
contain hit-lists of susceptible host addresses identified by earlier scans. 

Also problematic is that two instances of a worm on different networks could 
collaborate to ensure that none of their first-contact connections will appear 
to fail. For example, if worm A does not receive a response to a first-contact 
connection request after half the timeout period, it could send a message to 
worm B asking it to forge a connection response. This forged response attack 
prevents our system from detecting connection failures. To thwart this attack 
for TCP connections, a worm detection system implemented on a router can 
modify the TCP sequence numbers of traffic as it enters and leaves the network. 
For example, the result of a hash function ^.(IPiQgg^p IPj-emotei salt) may be added 
to all sequence numbers on outgoing traffic and subtracted from all incoming 
sequence numbers. The use of the secret salt prevents the infected hosts from 
calculating the sequence number used to respond to a connection request which 
they have sent, but not received. By storing the correct sequence number in the 
FCC queue, responses can then be validated by the worm detection system. 

Another concern is the possibility that a worm could arrive at its target al- 
ready in possession of a list of known repliers - hosts that are known to reply 
to connection requests at a given port. This known-replier attack could em- 
ploy lists that are programmed into the worm at creation, or accumulated by 
the worm as it spreads through the network. First-contact connections to these 
known-repliers will be very likely to succeed and can be interleaved with scans 
to raise the first-contact connection success rate. A one to one interleaving is 
likely to ensure that more than half of all connections succeed. This success rate 
would enable the scanner to bypass credit-based connection rate limiting, and 
delay detection by Reverse Sequential Hypothesis Testing until the scanner had 
contacted all of its known-repliers. What’s worse, a worm could avoid detection 
altogether if the detection system defines a first-contact connection with respect 
to a fixed sized previously contact host (PCH) set. If the PCH set tracks only the n 
previously visited hosts, the scanner can cycle through (n/2) -P 1 known-repliers, 
interleaved with as many new addresses, and never be detected^. To prevent a 
worm from scanning your local network by interleaving connections to known- 

For detecting such a worm, a random replacement policy will be superior to an LRU 

replacement policy, but will still not be effective enough for long known-replier lists. 
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repliers outside of your network, Weaver et al. [26] propose that one hypothesis 
test be run for local connections (i.e. those within the same IP block) and an- 
other for connections to remote hosts. If hosts in your local network are widely 
and randomly dispersed through a large IP space®, then a worm will have a low 
probability of finding another host to infect before being quarantined. 

A worm might also avoid detection by interleaving scanning with other ap- 
parently benign behavior, such as Web crawling. A subset of these benign inter- 
leaving attacks can be prevented by detecting scanners based on the destination 
port they target in addition to the source IP of the local host. While it is still 
fairly easy to create benign looking traffic for ports such as HTTP, for which 
one connection can lead to information about other active hosts receptive to new 
connections, this is not the case for ports such as those used by SSH. Running 
separate scan detection tests for each destination port that a local host addresses 
can ensure that connections to one service aren’t used to mask scans to other 
services. 

Finally, if an infected host can impersonate other hosts, the host could es- 
cape quarantine and cause other (benign) hosts to be quarantined. To address 
these address impersonation attacks, it is important that a complete system for 
network quarantining include strong methods for preventing IP masquerading 
by its local hosts, such as switch level egress filtering. Host quarantining should 
also be enforced as close to the host as is possible without relying on the host 
to quarantine itself. If these boundaries cannot be enforced between each host, 
one must assume that when one machine is infected, all of the machines within 
the same boundary will also be infected. 

7 Related Work 

We were motivated by the work of Moore et al. [10], who model attempts at con- 
taining worms using quarantining. They perform theoretical simulations, many 
of which use parameters principally from the CodeRed II [4, 20] outbreak. They 
argue that it is impossible to prevent systems from being vulnerable to worms 
and that treatment cannot be performed fast enough to prevent worms from 
spreading, leaving containment (quarantining) as the most viable way to pre- 
vent worm outbreaks from becoming epidemics. 

Early work on containment includes Staniford et al.'s work on the GrIDS 
Intrusion Detection System [19], which advocates the detection of worms and 
viruses by tracing their paths through the departments of an organization. More 
recently, Staniford [16] has worked to generalize these concepts by extending 
models for the spread of infinite-speed, random scanning worms through ho- 
mogenous networks divided up into ‘cells’. Simulating networks with 2^^ hosts 
(two class B networks) , Staniford limits the number of first-contact connections 
that a local host initiates to a given destination port to a threshold, T. While 
he claims that for most ports, a threshold of T = 10 is achievable in practice, 

® Randomly dispersing local hosts throngh a large IP space can be achieved by using 
a network address translation (NAT) switch. 
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HTTP and KaZaA are exceptions. In comparison, reverse sequential hypothesis 
testing reliably identifies HTTP scanning in as few as 10 observations. 

The TRAFEN [2, 3] system also observed failed connections for the purpose 
of identifying worms. The system was able to observe larger networks, without 
access to end-points, by inferring connection failures from ICMP messages. One 
problem with acting on information at this level is that an attacker could spoof 
source IP addresses to cause other hosts to be quarantined. 

Our use of rate limiting in order to buy time to observe worm behavior was 
inspired by the virus throttle presented by Twycross and Williamson [22], which 
we described in detail in Section 3. Worms can evade a throttle by scanning at 
rates below one connection per second, allowing epidemics to double in size as 
quickly as once every two seconds. 

An approach quite similar to our own has been simultaneously developed by 
Weaver, Staniford, and Paxson [26]. Their approach combines the rate limiting 
and hypothesis testing steps by using a reverse sequential hypothesis test that 
(like our CBCRL algorithm) assumes that connections fail until they are proven 
to succeed. As with CBCRL, out-of-order processing could cause a slight increase 
in detection delay, as the successes of connections sent before an infection event 
may be processed after connections are initiated after the infection event. In the 
context of their work, in which the high-performance required to monitor large 
networks is a key goal, the performance benefits are likely to outweigh the slight 
cost in detection speed. 

For a history and recent trends in worm evolution, we recommend the work 
of Kienzle and Elder [8] . For a taxonomy of worms and a review of worm termi- 
nology, see Weaver et al. [25]. 

8 Future Work 

As worm authors become aware of the limitations discussed in Section 6, it will 
be necessary to revise our algorithms to detect scanning at the resolution of 
the local host (source address) and targeted service (destination port), rather 
than looking at the source host alone. Solutions for managing the added memory 
requirements imposed by this approach have been explored by Weaver, Staniford, 
and Paxson [26]. 

The intrusiveness of credit-based connection rate limiting, which currently 
drops outgoing connection requests when credit balances reach zero, can be fur- 
ther reduced. Instead of halting outgoing TCP first-contact connection requests 
from hosts that do not maintain a positive credit balance, the requests can be sent 
immediately and the responses held until a positive credit balance is achieved. 
This improvement has the combined benefits of reducing the delays caused by 
false rate limiting while simultaneously ensuring that fewer connections are al- 
lowed to complete when a high-speed scanning worm issues a burst of connection 
requests. As a result, the remaining gap in response speed between credit-based 
connection rate limiting and Twycross and Williamson’s virus throttle can be 
closed while further decreasing the risk of acting on false positives. 

Finally, we would like to employ additional indicators of infection to further 
reduce the number of first-contact connection observations required to detect a 
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worm. For example, it is reasonable to conclude that, when a host is deemed 
to be infected, those hosts to which it has most recently initiated successful 
connections are themselves more likely to be infected (as was the premise behind 
GrIDS [19]). We propose that this be accomplished by adding an event type, the 
report of an infection of a host that has recently contacted the current host, to 
our existing hypothesis test. 

9 Conclusion 

When combined, credit-based connection rate limiting and reverse sequential 
hypothesis testing ensure that worms are quickly identified with an attractively 
low false alarm rate. While no system can detect all possible worms, our new 
approach is a significant improvement over prior methods, which detect a smaller 
range of scanners and unnecessarily delay network traffic. What’s more, the 
techniques introduced in this paper lend themselves to efficient implementation, 
as they need only be activated to observe a small subset of network events and 
require little calculation for the common case that traffic is benign. 
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A Optimizing the Computation 

of Repeated Reverse Sequential Hypothesis Tests 

It is unnecessarily expensive to repeatedly recompute A in reverse sequence 
each time a new first-contact connection is observed. A significant optimization 
requires that we maintain single state variable A, calculated iteratively in the 
order in which events are observed. 

i(Y„) = max(l, A(Y „_,)</>(¥„)} A(Yo) = 1. (1) 

We will prove that yi(Y„) > rji if and only if a reverse sequential hypothesis 
test starting backward from observation n would lead us to conclude that the 
host was infected. 

We first prove the following lemma stating that if a reverse sequential hy- 
pothesis test reports an infection, our optimized algorithm will also report an 
infection. 

Lemma 1. For rji > 1 and for mutually independent random variables Yi, 

Vto e [l,n] : A(Y„,Y„_i,...,Y„) >771^ A( Y„) > rp (2) 

Proof. We begin by replacing the A term with its equivalent expression in terms 
of (/): 

771 < A(Y„,Y„_i,...,Yn) (3) 

n 

< n (4) 

i—m 

We can place a lower bound on the value of A(Y„) by exploiting the fact 
that, in any iteration, A cannot return a value less than 1. 

A(Y„) = A(Yi,Y2 ,...,Y„) 

>1- A{Yra,Yra+l,...,Yn) 
n 

> n > 771 

i—m 

where the last inequality follows the steps taken in Equations (3) and (4). 

Thus, A(Y„,Y„_i,...,Y«) > 771 ^ A(Y„) > rji. 

We must also prove that our optimized algorithm will only report an infection 
when a reverse sequential hypothesis test would also report an infection. Recall 
that a reverse sequential hypothesis test will only report an infection if A exceeds 
771 before falling below 770. 

Lemma 2. For thresholds 770 < 1 < 771 and for mutually independent random 
variables Yi, if A{Yi) > rji for some i = n, but A{Yi) < 771 for all i € [1,77— 1], 
then there exists a subsequence of observations starting at observation n and 
moving backward to observation m G [1,77] for which A(Y„, Yi_i, . . . , Yn) > 771 
and such that there exists no k in [777, 77] such that A(Yn, Y„_i, . . . , Yc) < 779. 
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Proof. Choose m as the largest observation index for which it held that: 

A{Y^_2mym-i) < 1 

We know that m < n because A{Yn-\)4>{Yn) is greater than iji which is 
in turn greater than 1. Let m = 1 if the above relation does not hold for any 
observation with index greater than 1 . It follows that AfY^n-i) = 1 and thus: 

^(Y^) = <j>{Yra) 

Because we chose m such that A{Y j- 2 )(f{Yj-i) > 1 for all j > m: 

n 

^■(Y„) = n 

j^m 

Thus, yi(Y„) > ryi ^ yl(Y„, Y„_i, . . . , Yn) > rji. 

To prove that there exists no k in [m, n] such that A{Yn, Yi-i, ■ • • , Yc) < 770 , 
suppose that such a k exists. It follows that: 

n 

Y[ (l>{Yj) <7]o <1 (5) 

j=k 

Recall that we chose m to ensure that: 

n 

?7i < n 

j=m 

The product on the right hand side can be separated into factors from before 
and after observation k. 

k— 1 n 

r?i < n (7) 

j=m j=k 

We then use Equation (5) to substitute an upper bound of 1 on the latter prod- 
uct. 

fc-i 

< n 

j=m 

Vi < ^(Yfe_i) 

This contradicts the hypothesis that R(Yi) < rji for all i £ [l,n — 1]. 

If we were concerned with being notified when the test came to the ‘benign’ 
conclusion we could create an analogous function A: 

i(Y„) = min(l, R(Y„_l)(/)(Y^)) 

The lemmas required to show equivalence and proof are also analogous. 
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Abstract. The detection of unknown viruses is beyond the capability 
of many existing virus detection approaches. In this paper, we show how 
proactive customization of system behaviors can be used to improve the 
detection rate of unknown malicious executables. Two general proac- 
tive methods, behavior skewing and cordoning, and their application in 
BESIDES, a prototype system that detects unknown massive mailing 
viruses, are presented. 

Keywords: virus detection; malicious executable detection; intrusion 
detection. 



1 Introduction 

Two major approaches employed by intrusion detection systems (IDSs) are 
misuse-based detection and anomaly-based detection. Because misuse-based de- 
tection relies on known signatures of intrusions, misuse-based IDSs cannot in 
general provide protection against new types of intrusions whose signatures are 
not yet cataloged. On the other hand, because anomaly-based IDSs rely on iden- 
tifying deviations from the “normal” profile of the protected system’s behavior, 
they are prone to reporting false-positives unless the normal profile of the pro- 
tected system is well characterized. 

In either case, detection must be performed as soon as an intrusion occurs 
and certainly before the intruder can cause harm and/or hide its own tracks. In 
this paper, we present the paradigm of PAIDS (ProActive Intrusion Detection 
System) and describe an application of PAIDS that can detect some classes of 
intrusions without knowing a priori their signatures and does so with very small 
false positives. PAIDS also provides a way to dynamically trade off the time it 
takes to detect an intrusion and the damage that an intruder can cause, and can 
therefore tolerate intrusions to an extent. To achieve these advantages, PAIDS 
exploits two general techniques: behavior skewing and eordoning. 

Traditionally, the security of a computer system is captured by a set of secu- 
rity policies. A complete security policy should classify each behavior of a system 
as either legal or illegal. In practice, however, specifications of security policies 
often fail to scale [1] and are likely to be incomplete. Given a security policy, 
we can instead partition the set of all behaviors of a system into three subsets: 
(SI) Legal behaviors, (S2) Illegal behaviors and (S3) Unspecified behaviors, cor- 
responding to whether a behavior is consistent, inconsistent or independent of 
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the security policy. (Alternatively, think of a security policy as the axioms of a 
theory for establishing the legality of system behaviors.) In this context, behav- 
ior skewing refers to the modification of a security policy P into P’ such that the 
subset of legal behaviors remains unchanged under P’, but some of the behaviors 
that are in the subset (S3) under P are illegal under P’. By implementing de- 
tectors that can recognize the enlarged subset (S2), behavior skewing can catch 
intruders that would otherwise be unnoticed under the unmodified policy P. The 
speed of detection depends on the length of the prefix of the illegal behavior that 
distinguishes it from the legal behaviors. To contain the damage caused by an 
intrusion, we seek ways to isolate the system components that are engaged in the 
illegal behavior, hopefully until we can distinguish it from the legal behaviors. 
By virtualizing the environment external to the system components that are 
engaged in the illegal behavior, cordoning prevents the system from permanent 
damage until the illegal behavior can be recognized. 

In this paper, we illustrate an application of the PAIDS paradigm by BE- 
SIDES, a tool for detecting massive mailing viruses. We have applied behavior 
skewing and cordoning techniques to the NT-based Windows operating systems 
(Windows 2000 and Windows XP). Inasmuch as behavior skewing and cordoning 
are general techniques, we specialize them to certain types of system behaviors 
for efficient implementation. Specifically, we use behavior skewing to customize 
the security policy upon the use of certain information items on the protected 
system. An information item can be any logical entity that carries information, 
e.g., a filename, an email address, or a binary file. Behavior skewing is accom- 
plished by customizing the access control mechanism that governs the access to 
the information items. In a similar vein, cordoning is applied to critical system 
resources whose integrity must be ensured to maintain system operability. Specif- 
ically, we provide mechanisms for dynamically isolating the interactions between 
a malicious process and a system resource so that any future interaction between 
them will not affect other legal processes’ interaction with the resource. This is 
achieved by replacing the original resource with a virtual resource the first time 
a process accesses a system resource. The cordoning mechanism guarantees for 
each legal process that its updates to critical system resources are eventually 
committed to the actual system resources. If a malicious executable is detected, 
an execution environment (a cordon) consisting of virtual system resources is 
dynamically created for the malicious executable and its victims. Their previous 
updates to the actual system resources can be undone by performing recovery 
operations on those resources, while their future activities can be monitored and 
audited. Depending on the nature of the system resources, cordoning can be 
achieved through methods such as cordoning-in-time and cordoning-in-space. 

The rest of this paper is organized as follows: Section 2 is a brief discussion 
of related works. Section 3 gives the details of how behavior skewing and cor- 
doning work. Section 4 discusses the implementation of BESIDES, a prototype 
system we have implemented for detecting massive-mailing viruses. Section 5 
presents the experiments we have performed with BESIDES and the analysis of 
the experimental results. Section 6 discusses some future directions. 
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2 Related Work 

Virus detection is closely related to the more general topic of malicious exe- 
cutable detection [2,3]. Traditional malicious executable detection solutions use 
signature-based methods, where signatures from known malicious executables 
are used to recognize attacks from them [4]. Security products such as virus 
scanners are examples of such applications. One of the focuses in signature-based 
methods research is the automatic generation of high quality signatures. Along 
this line of work, Kephart and Arnold developed a statistical method to extract 
virus signatures automatically [5] . Heuristic classifiers capable of generating sig- 
natures from a group of known viruses were studied in [6] . Recently, Schultz et 
al. examined how data mining methods can be applied to signature generation 
[7] and built a binary filter that can be integrated with email servers [8]. Al- 
though proved to be highly effective in detecting known malicious executables, 
signature-based methods are unable to detect unknown malicious executables. 
RAIDS explores the possibility of addressing the latter with a different set of 
methods, the proactive methods. In RAIDS, signatures of malicious behaviors 
are implicitly generated during the behavior skewing stage and they are later 
used in the behavior monitoring stage to detect malicious executables that per- 
form such behaviors. 

Static analysis techniques that verify programs for compliance with security 
properties have also been proposed for malicious executable detection. Some of 
them focus on the detection of suspicious symptoms: Biship and Dilger showed 
how file access race conditions can be detected dynamically [9]. Tesauro et al. 
used neural networks to detect boot sector viruses [10]. Another approach is to 
verify whether safe programming practices are followed. Lo et al. proposed to use 
“tell-tale signs” and “program slicing” for detecting malicious code [11]. These 
approaches are mainly used as preventive mechanisms and the approach used in 
RAIDS focuses more on the detection and tolerance of malicious executables. 

Dynamic monitoring techniques such as sandboxes represent another ap- 
proach that contributes to malicious executable detection. They essentially im- 
plement alternative reference monitoring mechanisms that observe software ex- 
ecution and enforce additional security policies when they detect violations [12]. 
The range of security policies that are enforcible through monitoring were stud- 
ied in [13-15] and more general discussion on the use of security policies can be 
found in [1]. The behavior monitor mechanism used in RAIDS adopts a similar 
approach. Sandboxing is a general mechanism that enforces a security policy by 
executing processes in virtual environments (e.g., Tron [16], Janus [17], Consh 
[18], Mapbox [19], SubDomain [20], and Systrace [21]). Cordoning is similar to 
a light-weight sandboxing mechanism whose coverage and time parameters are 
customizable. However, cordoning emphasizes the virtualization of individual 
system resources, while traditional sandboxing mechanisms focus on the virtu- 
alization of the entire execution environment (e.g., memory space) of individual 
processes. More importantly, sandboxing usually provides little tolerance toward 
intrusions, while cordoning can tolerate misuse of critical system resources as ex- 
plained in Section 3.2. 
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Deception tools have long been used as an effective way of defeating mali- 
cious intruders. Among them, Honeypot (and later Honeynet) is a vulnerable 
system (or network) intentionally deployed to lure intruders’ attentions. They 
are useful for studying the intruders’ techniques and for assessing the efficacy of 
system security settings [22,23]. Traditional Honeypots are dedicated systems 
that are configured the same way as (or less secure than, depending on how 
they are used) production systems so that the intruders have no direct way to 
tell the difference between the two. In the context of the Honeypot, no modi- 
fication is ever performed on production systems. The latest advances such as 
virtual Honeypots [24, 25] that simulate physical Honeypots at the network level 
still remain the same in this regard. Recently, the concept of Honeypots was 
generalized to Honeytokens - “an information system resource whose value lies 
in unauthorized or illicit use of that resource” [26] . So far, few implementation 
or experimental results have been reported (among them, a Linux patch that 
implements Honeytoken files was made available early 2004 [27]). The Honeyto- 
ken concept comes the closest to RAIDS. However, the proactive methods that 
RAIDS explores, such as behavior skewing and cordoning, are more compre- 
hensive and systematic than Honeytokens. We note that the implementation of 
our IDS tool BESIDES was well on its way when the Honeytoken concept first 
appeared in 2003. 

System behavior modifications are gaining more and more interest recently. 
Somayaji and Forrest applied “benign responses” to abnormal system call se- 
quences [28] . The intrusion prevention tool LaBrea is able to trap known intrud- 
ers by delaying their communication attempts [29]. The virus throttles built by 
Williamson et al. [30-32] utilized the temporal locality found in normal email 
traffics and were able to slow down and identify massive mailing viruses as they 
made massive connection attempts. Their success in both intrusion prevention 
and toleration confirms the effectiveness of behavior modification based meth- 
ods. However, the modifications performed in all these methods do not attempt 
to modify possibly legal behaviors since that may incur false positives, while the 
behavior skewing method in RAIDS takes a further step and converts some of 
the legal but irrelevant behaviors into illegal behaviors. No false positives are 
induced in this context. 

Intrusion detection is a research area that has a long history [33,34]. Tradi- 
tionally, IDSs have been classified into following categories: Misuse-based IDSs 
look for signatures of known intrusions, such as a known operation sequence 
that allows unauthorized users to acquire administrator privileges [35,36]. This 
is very similar to the signature-based methods discussed earlier in this section. 
Anomaly-based IDSs detect intrusions by identifying deviations from normal 
profiles in system and network activities. These normal profiles can be stud- 
ied from historical auditing data that are collected during legitimate system 
executions [34, 37-39] . Specification-based IDSs look for signatures of known le- 
gitimate activities that are derived from system specifications [40]. These IDSs 
differ from misused-based IDSs in that activities that do not match any signa- 
ture are treated as intrusions in the former, but are regarded as legitimate ones 
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in the latter. Hybrid IDSs integrate different types of IDSs as subsystems [41]. 
The limitations of different subsystems can be compensated and a better overall 
quality can be improved. 

Many techniques devised in IDSs are applicable to malicious executable de- 
tection and vice versa. For example, in an effort to apply the specification-based 
approach to malicious executable detection, Giffin et al. showed how malicious 
manipulations that originated from mobile code can be detected [42] . Christodor- 
escu and Jha proposed an architecture that detects malicious binary executables 
even in the presence of obfuscations [43]. One common goal of all IDSs (anomaly- 
based IDSs in particular) is to generate a profile using either explicit or implicit 
properties of the system that can effectively differentiate intrusive behaviors 
from normal behaviors. A wide variety of system properties, ranging from low- 
level networking traffic statistics and system call characteristics to high-level 
web access patterns and system resource usages, have been studied in litera- 
ture. McHugh and Gates reported a rich set of localities observed in web traffic 
and pointed out such localities can be used to distinguish abnormal behaviors 
from normal behaviors [44] . None of them considers the possibility of modifying 
system security settings for intrusion detection purposes, which is explored in 
PAIDS. The proactive methods proposed in PAIDS modify the definition of nor- 
mal behaviors through skewing the security policy in anticipation of the likely 
behavior of intruders. By doing so, PAIDS implicitly generates profiles that have 
small false positive rates, are easy to understand and configure, and are highly 
extensive. 

3 Methodology 

3.1 Behavior Skewing 

As illustrated in Fig.l, behaviors are high-level abstractions of system activities 
that specify their intentions and effects, but ignore many of the implementation 
details of the actual operations they perform. We refer to the part of security 
assignments of behaviors that are explicitly specified to be legal (or illegal) as 
the legal behavior set (LBS) (or the illegal behavior set (IBS)). The security 
assignments of behaviors that are not explicitly specified either intentionally or 
unintentionally are denoted as the unspecified behavior set (UBS). The UBS 
consists of behaviors that either 1) the user considers to be irrelevant to her 
system’s security or 2) the user is unaware of and unintentionally fails to specify 
their security assignments; the default security policy will apply to the behaviors 
in the set UBS. Behavior skewing refers to the manipulation of the security policy 
that customizes the security assignments of the behaviors in the set UBS that 
are implicitly assigned to be legal by default. The goal of behavior skewing is to 
create enough differences in the legal behaviors among homogeneous systems so 
that malicious executables are made prone to detection. 

Specifically, behavior skewing customizes a security policy regarding the use 
of certain information items in a system. Behavior skewing creates its own ac- 
cess control mechanism that describes the use of information items since many 
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The System Behavior Space 

Fig. 1. System States and Behavior Skewing 



of the information items are not system resources and are thus not protected 
by native security mechanisms. The customization is performed on information 
domains, which are sets of information items that are syntactically equivalent 
but semantically different. For example, all text files in a target system form 
an information domain of text files. Specifically, behavior skewing reduces the 
access rights to existing items that are specified by the default access rights and 
creates new information items in an information domain with reduced access 
rights. 

Figure 1 shows two possible behavior skewing instances. We emphasize that 
although different skewing mechanisms produce different security settings, they 
all preserve the same LBS. The modified security policy generated by a behavior 
skewing is called the skewed security policy (or skewed policy in short). The 
default security policy is transparent to the user and cannot be relied upon 
by the user to specify her intentions. Otherwise, additional conflict resolution 
mechanisms may be required. 

After behavior skewing is completed, the usage of information items is moni- 
tored by a behavior monitoring mechanism that detects violations of the skewed 
policy. Any violation of the skewed policy triggers an intrusion alert. It should 
be noted that the monitoring mechanism does not enforce the skewed policy, 
instead it simply reports violations of the skewed policy to a higher-level entity 
(e.g., the user or an IDS) that is responsible for handling the reported violations. 



3.2 Cordoning 

Although behavior skewing and monitoring make malicious executables more 
prone to detection, actual detection cannot happen before the malicious exe- 
cutables have their chance to misbehave. Hence, there is a need for additional 
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protection mechanisms to cope with any damage the malicious executables may 
incur before they are eventually detected. Among them, the recovery of sys- 
tem states is an obvious concern. In this section, we illustrate another proactive 
method, cordoning, that can be used to recover states of selected system re- 
sources. Existing system recovery solutions, such as restoring from backup media 
or from revertible file systems, usually perform bulk recovery operations, where 
the latest updates to individual system resources containing the most recent 
work of a user may get lost after the recovery. Cordoning addresses this problem 
by performing the recovery operation individually. 

In general, cordoning is a mechanism that allows dynamic partial virtual- 
ization of execution environments for processes in a system. It is performed on 
critical system resources (CSRs), objects in the system whose safety are deemed 
critical to the system’s integrity and availability (e.g., executables, network ser- 
vices, and data files, etc.). The cordoning mechanism converts a CSR (also called 
an actual CSR) to a recoverable CSR (or a cordoned CSR) that can be recovered 
to a known safe state by dynamically creating a virtual CSR (called the current 
CSR) , and granting it to processes that intend to interact with the actual CSR. 
The current CSR provides the same interface as the actual CSR. The underly- 
ing cordoning mechanism ensures all updates to the current CSR are eventually 
applied to the actual CSR during normal system execution. 

When a malicious executable is detected by the behavior monitor, its updates 
to all cordoned CSRs can be recovered by performing the corresponding recover 
operations on the cordoned CSRs that restore the actual CSRs to known secure 
states (called recovered CSRs). The malicious executable and its victims - its 
children, as well as processes accessing system resources that have been updated 
by the malicious executable^ - continue to interact with the current CSRs, the 
same set of CSRs they are using at the time of detection. However, their future 
updates to this set of current CSRs will not be applied to the actual CSRs and 
are instead subject to audition and other intrusion investigation mechanisms. 
Unaffected processes - existing processes other than the malicious executable and 
its victims, and all newly started processes - will use the recovered CSRs in their 
future interactions. The malicious executable and its victims are thus cordoned 
by a dynamically created execution environment (called a cordon) that consists 
of these current CSRs. The operations performed by the malicious executable 
and its victims on CSRs are thus isolated from those performed by the unaffected 
processes. 

CSRs can be classified as revertible CSRs, delayable CSRs, and substitutable 
CSRs based on their nature. Two cordoning mechanism: cordoning-in-time and 
cordoning-in-space can be applied to these CSRs as described below: 

A revertible CSR is a CSR whose updates can be revoked. For example, a 
file in a journaling file system can be restored to a previous secure state if it is 
found corrupted during a virus infection. Cordoning of a revertible CSR can be 



^ We note that an accurate identification of victims can be as difficult as the identifica- 
tion of covert channels. The simple criteria we use here is a reasonable approximation 
although more accurate algorithms can be devised. 
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Fig. 2. A Cordoning Example 



achieved by generating a revocation list consisting of revocation operations of all 
committed updates. The recovery of a revertible resource can be performed by 
carrying the revocation operations in the revocation list that leads to a secure 
state of the CSR. 

Cordoning-in-time buffers operational requests on a CSR and delay their 
commitments until a secure state of the resource can be reached. It is thus 
also referred to as delayed commitment. Cordoning-in-time can be applied to 
delayahle CSRs - CSRs that can tolerate certain delays when being accessed. 
For example, the network resource that serves a network transaction is delayable 
if the transaction is not a real-time transaction (e.g., a SMTP server). The delays 
on the requests can be made arbitrarily long unless it exceeds some transaction- 
specific time-out value. Time-outs constraint the maximum length of delays. 
Such a constraint, as well as others (e.g., constraints due to resource availability) 
may render a delayable CSR a partially recoverable CSR; the latter can only be 
recovered within a limited time interval. The recovery of a delayable resource 
can be performed by discarding buffered requests up to a secure state. If no such 
secure state can be found after exhausting all buffered requests, the CSR may 
not be securely recovered unless it is also revertible. 

Cordoning-in-space is applied to a substitutable CSR - a CSR that can be 
replaced by another CSR (called its substitute) of the same type transparently. 
For example, a file is a substitutable CSR because any operation performed 
on a file can be redirected to a copy of that file. Cordoning-in-space redirects 
operational requests from a process toward a substitutable CSR to its substitute. 
The actual CSR is kept in secure states during the system execution and is 
updated by copying the content of its substitute only when the latter is in secure 
states. A substitutable CSR can be recovered by replacing it with a copy of the 
actual CSR saved when it is in a secure state. Where multiple substitutes exist for 
a CSR (e.g., multiple writer processes), further conflict resolution is required but 



90 



Ruiqi Hu and Aloysius K. Mok 




will not be covered in this paper. We note here that the cordoning and recovery 
of CSRs are independent mechanisms that can be separately performed. 

Figure 2 shows an example of cordoning on two CSRs: Resource #2 and 
Resource #4. Process B and Process C are identified as victims of a virus in- 
fection. They are placed in a cordon consisting of the Current Resource #2 and 
the Current Resource #4. The cordoning mechanism performs the recovery op- 
erations, where the Recovered Resource #2 and the Recovered Resource #4 are 
created. Unaffected processes that shares CSRs with the victims, e.g.. Process 
A, continue to use Recovered Resource #2. New processes such as Process D will 
be given the Recovered Resource #4 when it requests to access Resource #4 in 
the future. The remaining system resources, such as Resource #1, #3, and #5, 
are not cordoned. 

4 Implementation 

BESIDES is a prototype system that uses behavior skewing, behavior monitor- 
ing, and cordoning to detect unknown massive mailing viruses. As illustrated 
in Figure 3, BESIDES consists of an email address domain skewer (EADS), an 
email address usage monitor (EAUM) and a SMTP Server Cordoner (SSC). The 
EADS is a user-mode application and the EAUM and the SSC are encapsulated 
in a kernel-mode driver. BESIDES is implemented on Windows 2000 and also 
runs on Windows XP. The main reason we choose to build a Windows based 
system is that Windows is one of the most attacked systems and virus attacks 
in particular are frequently seen there. Developing and experimenting BESIDES 
on Windows allows us to illustrate our approach and to show that it can be 
implemented on commercial operating systems. 

4.1 The Email Address Domain Skewer 

After BESIDES is installed, the user needs to use the EADS to skew the behavior 
of the email address domain, an information domain that is actively exploited by 
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malicious executables such as massive mailing viruses. The skewing is performed 
by modifying the usage policy of email addresses. A typical Windows system 
does not restrict the use of email addresses. In particular, all email addresses 
can be used as email senders or recipients. The EADS changes this default usage 
policy by making certain email addresses unusable in any locally composed email. 
The affected email addresses are referred to as baiting addresses, or simply baits. 
The existing email addresses are not skewed unless the user is able to determine 
which of them is skewable. By default, the EADS grants all subjects full usage 
(i.e., “allow all”) of any email address except the baits. It deems any use of 
baits a violation of the skewed email address usage policy (i.e., “deny all”) The 
EAUM will issue an intrusion alert whenever it detects the use of a bait in any 
email messages sent locally. In addition, the EADS sets access rights to certain 
email domains (e.g., foo.com in alice@foo.com) in baits to “deny all” as well. 
This makes the skewing more versatile so that even viruses manipulating email 
addresses before using them (e.g., Bugbear) can also be detected. 

The skewing of email address domain requires the creation of enough unpre- 
dictability in the domain so that viruses are not likely to figure out whether an 
email address is legitimate by simply looking at the email address itself. The 
EADS uses both heuristic methods and randomization methods in its skewing 
procedure, i.e., it generates the baiting addresses either based on a set of baits 
the user specifies or in random. 

Specifically, the EADS creates email baits in following file types: email boxes, 
e.g. .eml files; text-based files, e.g. .HIM, .TXT, and .ASP files, etc.; and binary 
files, e.g., .MP3, .JPG, and .PDF files, etc. Email boxes are skewed by importing 
baiting email messages that use baiting addresses as senders. Text-based files 
are skewed with newly created syntactically valid files that contain baits. Binary 
files are skewed with the same text files as in text-based file skewing but with 
modified extensions. Such baiting binary files have invalid format and cannot be 
processed by legitimate applications that operate on these files. This does not 
pose a problem because these baits are not supposed to be used by them in the 
first place. Many massive mailing viruses, however, are still able to discover the 
baits within these files because they usually access them by “brute-force” and 
neglect their formats, e.g., by searching for email addresses with straightforward 
string matching algorithms. By default, all these baits are placed in commonly 
accessed system directories such as “C:\”, and “My Documents”. The user is 
allowed to pick additional directories she prefers. 

4.2 The Email Address Usage Monitor 

BESIDES uses system call interposition to implement the EAUM (and the SSC 
as well). System call interposition is a general technique that allows intercep- 
tion of system call invocations and has been widely used in many places [45, 46, 
20,47,19,17,48]. Windows 2000 has two sets of system calls [49]: The Win32 
application programming interfaces (APIs), the standard API for Windows ap- 
plications, are implemented as user-mode dynamically linked libraries (DLLs). 
The native APIs (or native system calls) are implemented in kernel and are 



92 



Ruiqi Hu and Aloysius K. Mok 



exported to user-mode modules through dummy user-mode function thunks in 
ntdll . dll. User-mode system DLLs use native APIs to implement Win32 APIs. 
A Win32 API is either implemented within user-mode DLLs or mapped to one 
(or a series of) native API(s). Windows 2000 implements the TCP/IP proto- 
col stack inside the kernel [50]. At the native system call interface, application 
level protocol behaviors are directly observable and can thus be efficiently mon- 
itored. Transport level protocol specific data (e.g., TCP/UDP headers) are not 
present at this interface. This saves the additional time needed to parse them, 
analyze them, and then reconstruct protocol data flow states that is unavoidable 
in typical network based interceptors. 

The EAUM runs in kernel mode and monitors the use of email addresses in 
SMTP sessions. An SMTP session consists of all SMTP traffic between a process 
and an SMTP server. It starts with a “HELD” (or “EHLD”) command and usually 
ends with a “QUIT” command. The EAUM registers a set of system call filters to 
the system call interposition module (also referred to as the BESIDES engine) 
during BESIDES initialization when the system boots up. It uses an SMTP au- 
tomaton derived from the SMTP protocol [51] to simulate the progresses in both 
the local SMTP client and the remote SMTP server. The SMTP automaton in 
BESIDES is shared among all SMTP sessions. The BESIDES engine intercepts 
native system calls that transmit network data and invokes the system call fil- 
ters registered by the EAUM. These filters extract SMTP data from network 
traffic, parse them to generate SMTP tokens, and then perform state transi- 
tions in the SMTP automaton that monitors the corresponding SMTP session. 
Each SMTP session is represented in the EAUM by a SMTP context, which is 
passed to the SMTP automaton as a parameter each time that particular SMTP 
session makes progress. The use of email addresses in a SMTP session can be 
monitored when the SMTP automaton enters corresponding states. Specifically, 
the EAUM looks for SMTP commands that explicitly uses email addresses (i.e., 
“MAIL FROM:” and “RCPT TO:”) and validates these usage against the skewed 
email address usage policy specified by the EADS. If any violation is detected, 
the EAUM notifies the SSC with an intrusion alert because none of the baiting 
addresses should be used as either a recipient or sender. The use of legitimate 
email addresses does not trigger any alert because their usage is allowed by the 
EADS. One advantage of this monitoring approach is that viruses that carry 
their own SMTP clients are subject to detection, while interpositions at higher 
levels (e.g., Win32 API wrappers) are bypassable. Misuse detection mechanisms 
in the form of wrappers around SMTP servers are not used since viruses may 
choose open SMTP relays that are not locally administrated. 

4.3 The SMTP Server Cor doner 

In addition to detecting massive mailing viruses, BESIDES also attempts to 
protect CSRs (here, SMTP servers) from possible abuses from them. A SMTP 
server is a delayable CSR since emails are not considered a real-time commu- 
nication mechanism and email users can tolerate certain amount of delays. It 
is also weakly revertible (by this we mean the damage of a delivered message 
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containing a virus can be mollified by sending a follow-up warning message to 
the recipient when the virus is later detected). Whenever a SMTP session is 
started by a process, the SSC identifies the SMTP server it requests and assigns 
it the corresponding virtual SMTP server (the current SMTP server). Delayed- 
commitment is then used to buffer the SMTP messages the process send to the 
virtual SMTP server. The SSC also runs in kernel-mode and shares the same 
SMTP automaton with the EAUM. 

Specifically, the SSC intercepts SMTP commands and buffers them inter- 
nally. It then composes a positive reply and has the BESIDES engine forward 
it to the SMTP client indicating the success of the command. After receiving 
such a reply, the SMTP client will consider the previous command successful 
and proceeds with the next SMTP command. The SSC essentially creates a vir- 
tual SMTP server for the process to interact with. The maximum time a SMTP 
message can be delayed is determined by the cordoning period - a user specified 
time-out value that is smaller than the average user tolerable delays, as well as 
the user specified threshold on the maximum number of delayed messages. A 
SMTP message is delivered (committed) to the actual SMTP server when either 
it is delayed more than the cordoning period or the number of delayed messages 
exceeds the message number threshold. After delivering a SMTP message, the 
SSC creates a corresponding log entry (a revocation record) in the SMTP server 
specific log containing the time, subject, and the recipient of the delivered mes- 
sage. When informed of an intrusion alert, the SSC identifies the process that is 
performing the malicious activity be the malicious executable. It then determines 
the set of victims based on the CSR access history and process hierarchy, i.e, all 
processes that access CSRs updated by this process and all its child processes 
are labeled as victims. After this, the SSC initiates the recovery operations on all 
cordoned CSRs they have updated. If the process that owns a SMTP session is 
one of the victims or the malicious executable itself, no buffered messages from 
that SMTP session is committed; instead they are all quarantined. All messages 
that are previously committed are regarded as suspicious and the SSC sends a 
warning message for their recipients as a weak recovery mechanism using the 
information saved in the delivery log entries. Since the SMTP messages sent to 
a SMTP server are independent, the order they are received does not affect the 
correct operation of the SMTP server [52]. Thus the actual SMTP server can 
be kept in a secure state even if some of the messages are dropped during the 
recovery operation. In the mean time, the unaffected processes are unaware of 
this recovery and can proceed as if no intrusion has occurred. 



5 Experimental Results 

We performed a series of experiments on BESIDES with viruses we collected 
in the wild. These experiments are performed on a closed local area network 
(LAN) consisting of two machines: a server and a client. BESIDES is installed 
on the client and its EADS is set up the same way in all experiments. The server 
simulates a typical network environment to the client by providing essential net- 
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Table 1. Effectiveness of BESIDES (The BESIDES SSC is configured to intercept at 
most 10 total SMTP messages and for at most 60 seconds for each SMTP message 
during these experiments. The Outlook Express book is manually skewed) 



Virus 


BugBear 


Haptime 


Klez 


MyDoom 


Client 


Detected? 


Yes 


Yes 


Yes 


Yes 


Baits Used at Detection 


Addr. Book 


.htm 


.html 


.htm 


Delayed Message Quarantined? 


Yes 


Yes 


Yes 


Yes 


Server 


Detected? (by anti-virus software) 


Yes 


No 


Yes 


No 


SMTP Message Received? 


Yes 


No 


No 


No 



work services, such as DNS, Routing, SMTP, POPS, and Remote Access Service 
(RAS). The server is also equipped with anti-virus software (i.e., Symantec An- 
tivirus) and network forensic tools (e.g.. Ethereal, TcpDump, etc.). Evidences of 
virus propagation can be gathered from output of these tools as well as service 
logs. 

Two sets of experimental results are presented in the remainder of this sec- 
tion. First we present the outcome when BESIDES is experimented with several 
actual viruses. These results demonstrate the effectiveness of behavior skewing 
and cordoning when they are applied in a real-world setting. The second set 
of results presents the performance overheads observed during normal system 
execution for normal system applications, including delays observed at the na- 
tive system call interfaces, and those at the command-line interface. As we have 
expected, the overheads are within a reasonable range even though we have not 
perform any optimization in BESIDES. 

5.1 Effectiveness Experiments 

The results of our experiments with four actual viruses - BugBear [53], Haptime 
[54], Klez [55], and MyDoom [56] - are shown in Table 1. In all the experi- 
ments, BESIDES were able to detect the viruses being experimented. Although 
all these viruses attempted to collect email addresses on the client machine, their 
methods were different and the actual baiting addresses they were using when 
BESIDES detected them also differed from each other. In all experiments, the 
BESIDES SSC intercepted multiple SMTP messages sent by viruses and suc- 
cessfully quarantined them. However, some of the virus carrying messages were 
found delivered before the virus was detected during the experiment with Bug- 
bear. From the email messages received by the SMTP server from Bugbear, we 
found Bugbear actually manipulated either the sender or the recipient addresses 
before sending them. Specifically, it forms a new email address by combining the 
user field of one email address and the domain field of another email address. 
BESIDES was initially unable to detect these messages since it only considers 
the matches of both the user name field and the domain field as an acceptable 
match to a bait. It then committed these messages to the SMTP server^. With 

^ This loophole is later fixed by creating additional email usage policy on email do- 
mains and creating baiting domains in the EADS. See Section 4.1 for details. 
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our experimental setups, an average two out of ten such messages were found 
committed to the SMTP server. We note that the actual numbers varies with 
the skewing manipulations. In one of our experiments, two such messages were 
thus committed before BESIDES detected the virus. The anti-virus software on 
the server detected Bugbear and Klez because they also spread over network 
shares, a mechanism that is not cordoned by BESIDES in all these experiments. 

We observed significant hard disk accesses from Haptime when it tried to 
infect local files and collecting email addresses from them. All these happened 
before the virus start to perform massive mailing operations. This suggested that 
Haptime can be detected faster if BESIDES skews file access rights as well. 

The outbreak of Mydoom was later than the version of BESIDES used in the 
experiments was completed. Thus BESIDES had no knowledge of the virus when 
it was experimented with it. BESIDES successfully detected the virus when it 
tried to send messages using baiting email addresses placed in .htm files. This 
demonstrated BESIDES’s capability in detecting unknown viruses. 

As some of the viruses use probabilistic methods, (e.g., Klez,) their behavior 
in different experiments can be different. The result shown here is thus only one 
possible behavior. Also, as some of the viruses use local time to decide whether 
particular operations are (or are not) to be performed (e.g., MyDoom does not 
perform massive mailing if it performs DDoS attacks, which is dependent on 
the system time.), we manually changed the client’s system time to hasten the 
activation of massive mailing by the virus. We emphasize that this is done to 
speed up our experiments and that system time manipulation is not needed to 
effect detection in production. 

5.2 Performance Experiments 

Overall System Overheads. Table 2 shows statistical results of the system 
call overheads observed at three native system call interface during normal sys- 
tem executions. Two interceptors, the pre-interceptor and the post-interceptor, 
are used to perform interception operations before and after the actual system 
call is executed. The three native system calls shown in the table are represen- 
tatives of three cordoning session phases. 

NtCreateFile 0 is invoked to create or open a file (including a socket) [49]. 
BESIDES intercepts it so that SMTP sessions can be recognized and their cor- 
responding SMTP contexts can be created. It is the setup phase of a SMTP 
session. The post-interceptor is used to perform these operations. As can be 
seen from Tab. 2, the overhead is a fraction of actual system call. 

NtDeviceloControlFile 0 performs an I/O control operation on a file ob- 
ject that represents a device [49]. Network packets are also transmitted using 
this system call. BESIDES intercepts this system call so that it can inspect 
SMTP session control and data messages. This is the inspection phase of a 
SMTP session. During the interception, SMTP data are parsed, SMTP tokens 
are generated, and the BESIDES EAUM SMTP automaton’s state is updated. 
The pre-interceptor and the post-interceptor processes sent data and received 
data, respectively. The overhead observed is only a small percentage of the ac- 
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Table 2. BESIDES system call overheads observed at native system call interface. All 
numbers are calculated from CPU performance counter values directly retrieved from 
the client machine (Pentium III 730MHz CPU with 128MB memory) 



Native System Call 


NtCloseO 


NtCreateFile 0 


NtDeviceloControlFile 0 


Total Execution Time 


283076 


1997247 


16110683 


Pre-Interceptor Time 


108520 


24033 


21748 


Actual System Call Time 


32960 


1471367 


15791127 


Post-Interceptor Time 


23588 


371826 


156350 


System Call In- 
terposition Time 


118008 


130021 


141457 


Overhead (in %) 


758.85% 


35.74% 


2.02% 



trial system call because the actual system call incurs expensive hardware I/O 
operations. 

NtCloseO is invoked by a user-mode process to close a handle to an object 
[49]. BESIDES intercepts it to terminate a SMTP session. This is called the 
termination phase of that SMTP session. The pre-interceptor releases system 
resources that are used to monitor this SMTP session. The actual system call is 
a lightweight system call and it takes much less time than the other two. The 
overhead observed dominates the actual system call. However, as both setup and 
termination phases are only performed once during a SMTP session’s lifespan, 
their relatively high cost can be amortized by the much faster inspection phase. 

Finally, it should be noted that a different type of overhead, the system 
call interposition overhead, exists in all system call interceptions. This overhead 
accounts for the mandatory overhead each intercepted system call has to pay, in- 
cluding extra system call lookup time, and kernel-stack setup time, etc. However, 
for those system calls that are not intercepted, optimized shortcuts are created 
in interception routines so that as little overhead as possible is generated. 

Application Specific Overheads. We also measured the time overhead for 
several applications on the client machine where BESIDES is installed. Figure 
4 shows the overheads (average values of 10 separate runs) observed on a se- 
ries of applications that compile the postscript version of a draft of this paper 
from its .tex source files. The applications executed include delete (cleaning up 
the directory), dir (listing the directory content), mpost (building graphic .eps 
files), latex (The first run of latex), bibtex (preparing bibliography items), 
latex ^2 (The second run of latex), latex ^3 (The third run of latex), and 
dvips (converting the .dvi file to the postscript file). Both CPU intensive and 
I/O intensive applications are present and this series can be regarded as a repre- 
sentative of applications that require no network access. These applications are 
only affected by the native system call interposition overhead induced by the 
BESIDES engine. The average time overhead observed in these experiments is 
around 8%. The highest increases (around 13%) occur in latex #1 and latex #2, 
both of which perform significant I/O operations. The lowest increases (around 
1.5% and 3.3%) occur in dir and delete respectively. These are shell commands 
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Fig. 5. Time overhead for the command-line web client 



that perform simple tasks that require few system calls. We note that CPU in- 
tensive applications, e.g., dvips, suffer much smaller overheads (e.g., 4.3% for 
dvips). These results indicate that I/O intensive applications are likely to endure 
more overhead than CPU intensive applications. They conform to our expecta- 
tion as essentially all I/O operations are carried out by native system calls and 
are thus subject to interception, while CPU intensive operations contain higher 
percentage of user-mode code that makes few system calls. 

Figure 5 shows the overheads observed for a command-line based web client 
when it retrieves data files from the local web server. The sizes of data files used 
in this experiment range from IkB to 5MB. Although the web client retrieves 
these data files using HTTP, its network traffic is still subject to inspection of the 
SMTP filters in BESIDES. The system call interposition overhead is relatively 
small since the web client performs few other system calls during the whole 
process. The two largest overheads observed are around 6.3% and 5% (at IkB 
and 50kB, respectively). The two smallest overheads are 0% and 1.8% (at 5MB 
and lOOkB, respectively). The average overhead is around 3.4%, which is close 
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to 2.02%, the overhead observed at the NtDeviceloControlFile () interface. 
This confirms our previous speculation that the seemingly high increase in the 
session setup phase and the session termination phase can be amortized by the 
low overhead of the session inspection phase. 

6 Conclusions 

This paper presents a general paradigm, PAIDS for intrusion detection and tol- 
erance by proactive methods. We present our work on behavior skewing and 
cordoning, two proactive methods that can be used to create unpredictability 
in a system so that unknown malicious executables are more prone to be de- 
tected. This approach differs from existing ones in that such a proactive system 
anticipates the attacks from malicious executables and prepares itself for them 
in advance by modifying the security policy of a system. 

PAIDS enjoys the advantage that it can detect intruders that have not been 
seen yet in the raw and yet PAIDS has a very low false-positive rate of detec- 
tion. BESIDES is a proof-of-concept prototype using the PAIDS approach, and 
it can be enhanced in many directions. Obvious enhancements include skewers 
and cordoners for additional information domains (e.g., file access skewer) and 
system resources (e.g., file system cordoner). The BESIDES SSC can be aug- 
mented with more versatile handling and recovery schemes to cope with general 
malicious executables. We are also interested in devising more proactive meth- 
ods. In general, we want to investigate to what extent we can systematically 
cordon off parts or even all of a system by cordoning all the protocols they use 
to interact with the external environment. 

Finally, we would like to point out that the proactive methods we have stud- 
ied are only part of the solution to the general problem of detecting unknown 
malicious executables. A system that is equipped with only proactive techniques 
are still vulnerable to new types of malicious executables that do not misuse 
any of the skewed information domains or abuse the system in more subtle 
ways such as stealing CPU cycles from legitimate applications. PAIDS is not a 
cure-all in that it works only for viruses’ whose route of spreading infection or 
damage-causing mechanism is well characterized. A comprehensive solution that 
consists of techniques from different areas is obviously more effective because the 
weaknesses of each individual technique can be compensated by the strength of 
others. We would like to explore how proactive methods can be integrated with 
such hybrid solutions. 
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Abstract. Intrusion Detection Systems (IDSs) are used to monitor com- 
puter systems for signs of security violations. Having detected such signs, 
IDSs trigger alerts to report them. These alerts are presented to a human 
analyst, who evaluates them and initiates an adequate response. 

In practice, IDSs have been observed to trigger thousands of alerts per 
day, most of which are false positives (i.e., alerts mistakenly triggered 
by benign events). This makes it extremely difficult for the analyst to 
correctly identify the true positives (i.e., alerts related to attacks). 

In this paper we describe ALAC, the Adaptive Learner for Alert Classi- 
fication, which is a novel system for reducing false positives in intrusion 
detection. The system supports the human analyst by classifying alerts 
into true positives and false positives. The knowledge of how to classify 
alerts is learned adaptively by observing the analyst. Moreover, ALAC 
can be configured to process autonomously alerts that have been classi- 
fied with high confidence. For example, ALAC may discard alerts that 
were classified with high confidence as false positive. That way, ALAC 
effectively reduces the analyst’s workload. 

We describe a prototype implementation of ALAC and the choice of a 
suitable machine learning technique. Moreover, we experimentally vali- 
date ALAC and show how it facilitates the analyst’s work. 

Keywords: Intrusion detection, false positives, alert classification, ma- 
chine learning 



1 Introduction 

The explosive increase in the number of networked machines and the widespread 
use of the Internet in organizations has led to an increase in the number of unau- 
thorized activities, not only by external attackers but also by internal sources, 
such as fraudulent employees or people abusing their privileges for personal gain. 
As a result, intrusion detection systems (IDSs), as originally introduced by An- 
derson [1] and later formalized by Denning [8] , have received increasing attention 
in recent years. 

On the other hand, with the massive deployment of IDSs, their operational 
limits and problems have become apparent [2, 3, 15, 23]. False positives, i.e., alerts 
that mistakenly indicate security issues and require attention from the intrusion 
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detection analyst, are one of the most important problems faced by intrusion 
detection today [28]. In fact, it has been estimated that up to 99% of alerts 
reported by IDSs are not related to security issues [2,3, 15]. 

In this paper we address the problem of false positives in intrusion detection 
by building an alert classifier that tells true from false positives. We define alert 
classification as attaching a label from a fixed set of user-defined labels to an 
alert. In the simplest case, alerts are classified into false and true positives, but 
the classification can be extended to indicate the category of an attack, the 
causes of a false positive or anything else. 

Alerts are classified by a so-called alert classifier (or classifier for short). 
Alert classifiers can be built automatically using machine learning techniques 
or they can be built manually by human experts. The Adaptive Learner for 
Alert Classification (ALAC) introduced in this paper uses the former approach. 
Moreover, ALAC learns alert classifiers whose classification logic is explicit so 
that a human expert can inspect it and verify its correctness. In that way, the 
analyst can gain confidence in ALAC by understanding how it works. 

ALAC classifies alerts into true positives and false positives and presents 
these classifications to the intrusion detection analyst, as shown in Fig. 1 on 
page 106. Based on the analyst’s feedback, the system generates training ex- 
amples, which are used by machine learning techniques to initially build and 
subsequently update the classifier. The classifier is then used to classify new 
alerts. This process is continuously repeated to improve the alert classification. 
At any time the analyst can review the classifier. 

Note that this approach hinges on the analyst’s ability to classify alerts cor- 
rectly. This assumption is justified because the analyst must be an expert in in- 
trusion detection to perform incident analysis and initiate appropriate responses. 
This raises the question of why analysts do not write alert classification rules 
themselves or do not write them more frequently. An explanation of these issues 
can be based on the following facts: 

Analysts’ knowledge is implicit: Analysts find it hard to generalize, i.e., to 
formulate more general rules, based on individual alert classifications. For 
example, an analyst might be able to individually classify some alerts as false 
positives, but may not be able to write a general rule that characterizes the 
whole set of these alerts. 

Environments are dynamic: In real-world environments the characteristics 
of alerts change, e.g., different alerts occur as new computers and services are 
installed or as certain worms or attacks gain and lose popularity. The clas- 
sification of alerts may also change. As a result, rules need to be maintained 
and managed. This process is labor-intensive and error-prone. 

As stated above, we use machine learning techniques to build an alert clas- 
sifier that tells true from false positives. Viewed as a machine learning problem, 
alert classification poses several challenges. 

First, the distribution of classes (true positives vs. false positives) is often 
skewed, i.e., false positives are more frequent than true positives. Second, it is 
also common that the cost of misclassifying alerts is most often asymmetrical 
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i.e., misclassifying true positives as false positives is usually more costly than the 
other way round. Third, ALAC classifies alerts in real-time and updates its clas- 
sifier as new alerts become available. The learning technique should be efficient 
enough to perform in real-time and work incrementally, i.e., to be able to modify 
its logic as new data becomes available. Fourth, we require the machine learn- 
ing technique to use background knowledge, i.e., additional information such as 
network topology, alert database, alert context, etc., which is not contained in 
alerts, but allows us to build more accurate classifiers (e.g., classifiers using gen- 
eralized concepts). In fact, research in machine learning has shown that the use 
of background knowledge frequently leads to more natural and concise rules [16]. 
However, the use of background knowledge increases the complexity of a learning 
task and only some machine learning techniques support it. 

We revisit these challenges in Sect. 2.2, where we discuss them and present a 
suitable learning technique. The point made here is that we are facing a highly 
challenging machine learning problem that requires great care to solve properly. 

1.1 Related Work 

To the best of our knowledge, machine learning has not previously been used 
to incrementally build alert classifiers that take background knowledge into ac- 
count. However, some of the concepts we apply here have been successfully used 
in intrusion detection and related domains. 

Building IDSs. In intrusion detection, machine learning has been used primarily 
to build systems that classify network connections (e.g., 1999 KDD CUP [13]) 
or system call sequences (e.g., [22]) into one of several predefined classes. 

This task proved to be very difficult because it aimed at building IDSs only 
from training examples. Lee [17] developed a methodology to construct addi- 
tional features using data mining. He also showed the importance of domain- 
specific knowledge in constructing such IDSs. The key differences of our work 
is the real-time use of analyst feedback and that we classify alerts generated by 
IDSs, whereas other researchers used machine learning to build a new IDS. 

Fan [10] performed a comprehensive study of cost-sensitive learning using 
classifier ensembles with RIPPER, therefore his work is particularly relevant to 
ours. The work differs from ours in design goals: we developed a system to assist 
human users to classify alerts generated by an IDS, whereas Fan built an IDS 
using machine learning techniques. We also used a simplified cost model, in order 
to reduce the number of variable parameters in the system. Finally, the type of 
learning methods used is also different: ensemble-based learning methods vs. a 
single classifier in our case. 

Alert Classification. The methods used to classify alerts can be divided into two 
categories: first, methods that identify true positives and second, methods that 
identify false positives. 

Methods that identify true positives have been studied particularly well and 
can be summarized as follows: 
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— In environments with multiple IDSs, some methods enhance the confidence 
of alerts generated by more than one IDS (based on the assumption that the 
real attack will be noticed by multiple IDSs, whereas false positives tend to 
be more random) [28], 

— Couple sensor alerts with background knowledge to determine whether the 
attacked system is vulnerable [20,28], 

— Create groups of alerts and use heuristics to evaluate whether an alert is a 
false positive [6,35]. The work by Dain and Cunningham [6] is particularly 
relevant to us as it uses machine learning techniques: neural networks and 
decision trees to build a classifier grouping alerts into so-called scenarios. 
They also discuss domain-specific background knowledge used to discover 
scenarios. In contrast, our work focuses on alert classification, uses different 
background knowledge and different machine learning algorithms. 

The second category of alert classification methods identifies false positives 
and can be based on data mining and include root cause analysis [15], or on 
statistical profiling [23]. For example, Julisch [15] shows that the bulk of alerts 
triggered by an IDS can be attributed to a small number of root causes. He also 
proposes a data mining technique to discover and understand these root causes. 
Knowing the root causes of alerts, one can easily design filters to remove alerts 
originating from benign root causes. Our work differs from the above in that we 
use real-time machine learning techniques that take advantage of background 
knowledge. 



1.2 Paper Overview 

The remainder of this paper is organized as follows. In Section 2 we present 
the design of our system and analyze machine learning techniques and their 
limitations with regard to the learning problem we are facing. Section 3 describes 
the prototype implementation of the system and shows results obtained with 
synthetic and real intrusion detection data. In Section 4 we present conclusions 
and future work. 



2 ALAC — An Adaptive Learner for Alert Classification 

In this section we describe the architecture of the system and contrast it to a 
conventional setup. We introduce two modes in which the system can operate, 
namely recommender mode and agent mode. We then focus on machine learning 
techniques and discuss how suitable they are for alert classification. 



2.1 ALAC Architecture 

In a conventional setup, alerts generated by IDSs are passed to a human analyst. 
The analyst uses his or her knowledge to distinguish between false and true 
positives and to understand the severity of the alerts. Note that conventional 
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systems may use manual knowledge engineering to build an alert classifier or 
may use no alert classifier at all. In any case, the conventional setup does not 
take advantage of the fact that the analyst is analyzing the alerts in real-time: 
the manual knowledge engineering is separated from analyzing alerts. 




(a) Recommender mode 




Fig. 1. Architecture of ALAC in agent and recommender mode. 

As shown in Fig. 1, our system classifies alerts and passes them to the analyst. 
It also assigns a classification confidence (or eonfidenee for short), to alerts, which 
shows the likelihood of alerts belonging to their assigned classes. The analyst 
reviews this classification and reclassifies alerts, if necessary. This process is 
recorded and used as training by the machine learning component to build an 
improved alert classifier. 

Currently we use a simple human-computer interaction model, where the 
analyst explicitly classifies alerts into true and false positives. More sophisticated 
interaction techniques are possible and will be investigated as part of our future 
work. In addition to the training examples, we use background knowledge to 
learn improved classification rules. These rules are then used by ALAC to classify 
alerts. The analyst can inspect the rules to make sure they are correct. 

The architecture presented above describes the operation of the system in 
recommender mode. The second mode, agent mode, introduces autonomous pro- 
cessing to reduce the operator’s workload. 

In recommender mode (Fig. 1(a)), ALAC classifies alerts and passes all of 
them to the console to be verified by the analyst. In other words, the system 
assists the analyst suggesting the correct classification. The advantage for the 
analyst is that each alert is already preclassified and that the analyst has only to 
verify its correctness. The analyst can prioritize his or her work, e.g., by dealing 
with alerts classified as true positives first or sorting the alerts by classification 
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confidence. It is important to emphasize that at the end, the analyst will review 
all classifications made by the system. 

In agent mode (Fig. 1(b)), ALAC autonomously processes some of the alerts 
based on criteria defined by the analyst (i.e., classification assigned by ALAC 
and classification confidence) . By processing alerts we mean that ALAC executes 
user-defined actions associated with the class labels and classification confidence 
values. For example, attacks classified as false positives can be automatically 
removed, thus reducing the analyst’s workload. In contrast, alerts classified as 
true positives and successful attacks can initiate an automated response, such as 
reconfiguring a router or firewall. It is important to emphasize that such actions 
should be executed only for alerts classified with high confidence, whereas the 
other alerts should still be reviewed by the analyst. 

Note that autonomous alert processing may change the behavior of the sys- 
tem and negatively impact its classification accuracy. To illustrate this with an 
example, suppose the system classifies alerts into true and false positives and it 
is configured to autonomously discard the latter if the classification confidence 
is higher than a given threshold value. Suppose the system learned a good clas- 
sifier and classifies alerts with high confidence. In this case, if the system starts 
classifying all alerts as false positives then these alerts would be autonomously 
discarded and would never be seen by the analyst. These alerts would not become 
training examples and would never be used to improve the classifier. 

Another problem is that alerts classified and processed autonomously cannot 
be added to the list of training examples as the analyst has not reviewed them. 
If alerts of a certain class are processed autonomously more frequently than 
alerts belonging to other classes (as in the above example) , we effectively change 
the class distribution in the training examples. This has important implications 
as machine learning techniques are sensitive to class distribution in training 
examples. In the optimal case, the distribution of classes in training and testing 
examples should be identical. 

To alleviate these problems, we propose a technique called random sampling. 
In this technique we randomly select a fraction k of alerts which would normally 
be processed autonomously and instead forward them to the analyst. This en- 
sures the stability of the system. The value of fc is a tradeoff between how many 
alerts will be processed autonomously and how much risk of misclassification is 
acceptable. 

Background Knowledge Representation. Recall that we use machine learning 
techniques to build the classifier. In machine learning, if the learner has no 
prior knowledge about the learning problem, it learns exclusively from examples. 
However, difficult learning problems typically require a substantial body of prior 
knowledge [16], which makes it possible to express the learned concept in a more 
natural and concise manner. In the field of machine learning such knowledge is 
referred to as background knowledge, whereas in the field of intrusion detection 
it is quite often called context information (e.g., [32]). 

The use of background knowledge is also very important in intrusion detec- 
tion [28] . Examples of background knowledge include: 
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Network Topology. Network topology contains information about the struc- 
ture of the network, assigned IP addresses, etc. It can be used to better un- 
derstand the function and role of computers in the network. In the context of 
machine learning, network topology can be used to learn rules that make use 
of generalized concepts such as Subnet 1 , Intranet, DMZ, HTTPServer. 
Alert Context. Alert context, i.e., other alerts related to a given one, is in the 
case of some alerts (e.g., portscans, password guessing, repetitive exploits 
attempts) crucial to their classification. In intrusion detection various defini- 
tions of alert context are used. Typically, the alert context has been defined 
to include all alerts similar to it, however the definition of similarity varies 
greatly [6,5,34]. 

Alert Semantics and Installed Software. By alert semantics we mean how 
an alert is interpreted by the analyst. For example, the analyst knows what 
type of intrusion the alert refers to (e.g., scan, local attack, remote attack) 
and the type of system affected (e.g., Linux 2.4.20, Internet Explorer 6.0). 
Typically the alert semantics is correlated with the software installed (or the 
device type, e.g., Cisco PIX) to determine whether the system is vulnera- 
ble to the reported attack [20]. The result of this process can be used as 
additional background knowledge used to classify alerts. 

Note that the information about the installed software and alert semantics 
can be used even when alert correlation is not performed, as it allows us 
to learn rules that make use of generalized concepts such as OS Linux, OS 
Windows, etc. 

2.2 Machine Learning Techniques 

Until now we have been focusing on the general system architecture and issues 
specific to intrusion detection. In this section we focus on the machine learning 
component in our system. Based on the discussion in Sect. 1 and the proposed 
system architecture, we can formulate the following requirements for the machine 
learning technique: 

1. Learn from training examples (alert classification given by the analyst). 

2. Build the classifier whose logic can be interpreted by a human analyst, so 
its correctness can be verified. 

3. Be able to incorporate the background knowledge required. 

4. Be efficient enough to perform real-time learning. 

5. Be able to assess the confidence of classifications. Confidence is a numerical 
value attached to the classification, representing how likely it is to be correct. 

6. Support cost-sensitive classification and skewed class distributions. 

7. Learn incrementally. 

Learning an Interpretable Classifier from Examples. The first requirement yields 
supervised machine learning techniques, that is techniques that can learn from 
training examples. The requirement for an understandable classifier further lim- 
its the range of techniques to symbolic learning techniques, that is techniques 
that present the learned concept in a human readable form (e.g., predictive rules, 
decision trees, Prolog clauses) [22]. 
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Background Knowledge and Efficiency. The ability to incorporate background 
knowledge differentiates two big groups of symbolic learners: inductive logic pro- 
gramming and symbolic attribute- value learners. In general, inductive logic pro- 
gramming provides the framework for the use of background knowledge, repre- 
sented in the form of logic predicates and first-order rules, whereas attribute- 
value learners exclusively learn from training examples. Moreover, training ex- 
amples for attribute- value learners are limited to a fixed number of attributes. 

The inductive logic programming framework can easily handle the back- 
ground knowledge introduced in Sect. 2.1, including alert context as well as 
arbitrary Prolog clauses. As the search space is much bigger than in other ma- 
chine learning techniques, such as rule and decision tree learners, the size of prob- 
lems that can be solved efficiently by inductive logic programming is smaller and 
these learners are much less efficient. This may make such a system unsuitable 
for real-time learning. 

On the other hand, attribute-value learners can use a limited form of back- 
ground knowledge using so-called feature construction (also known as proposi- 
tionalization [16]) by creating additional attributes based on values of existing 
attributes or existing background knowledge. 

Given that most background knowledge for intrusion detection can be con- 
verted to additional features using feature construction, and considering the run- 
time requirement, symbolic attribute- value learners seem to be a good choice for 
alert classification. 

Confidence of Classifieation. Symbolic attribute-value learners are decision tree 
learners (e.g., C4.5 [30]) and rule learners (e.g., AQ [26], C4.5rules [30], RIP- 
PER [4]). Both of these techniques can estimate the confidence of a classifica- 
tion based on its performance on training examples. However, it has been shown 
that rules are much more comprehensible to humans than decision trees [27, 30] . 
Hence, rule learners are particularly advantageous in our context. 

We analyzed the characteristics of available rule learners, as well as published 
results from applications in intrusion detection and related domains. We have not 
found a good and publicly available rule learner that fulfills all our requirements, 
in particular cost-sensitivity and incremental learning. 

Of the techniques that best fulfill the remaining requirements, we chose RIP- 
PER [4] - a fast and effective rule learner. It has been successfully used in intru- 
sion detection (e.g., on system call sequences and network connection data [17, 
18]) as well as related domains and it has proved to produce concise and intuitive 
rules. As reported by Lee [17], RIPPER rules have two very desirable conditions 
for intrusion detection: a good generalization accuracy and concise conditions. 
Another advantage of RIPPER is its efficiency with noisy data sets. 

RIPPER has been well documented in the literature and its description is 
beyond the scope of this paper. However, for the sake of a better understanding 
of the system we will briefly explain what kind of rules RIPPER builds. 

Given a set of training examples labeled with a class label (in our case false 
and true alerts), RIPPER builds a set of rules discriminating between classes. 
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Each rule consists of conjunctions of attribute value comparisons followed by a 
class label and if the rule evaluates to true a prediction is made. 

RIPPER can produce ordered and unordered rule sets. Very briefly, for a two 
class problem, an unordered rule set contains rules for both classes (for both 
false and true alerts), whereas an ordered rule set contains rules for one class 
only, assuming that all other alerts fall into another class (so-called default rule) . 
Both ordered and unordered rule sets have advantages and disadvantages. We 
decided to use ordered rule sets because they are more compact and easier to 
interpret. We will discuss this issue further in Sect. 3.6. 

Unfortunately, the standard RIPPER algorithm is not cost-sensitive and does 
not support incremental learning. We used the following methods to circumvent 
these limitations. 

Cost-Sensitive Classification and Skewed Class Distribution. Among the various 
methods of making a classification technique cost-sensitive, we focused on those 
that are not specific to a particular machine learning technique: Weighting [33] 
and MetaCost [9]. By changing costs appropriately, these methods can also be 
used to address the problem of skewed class distribution. These methods produce 
comparable results, although this can be data dependent [9, 24]. Experiments not 
documented here showed that in our context Weighting gives better run-time 
performance. Therefore we chose Weighting for our system. 

Weighting resamples the training set so that a standard cost-insensitive learn- 
ing algorithm builds a classifier that optimizes the misclassification cost. The 
input parameter for Weighting is a cost matrix, which defines the costs of mis- 
classifications for individual class pairs. For a binary classification problem, the 
cost matrix has only one degree of freedom - the so-called cost ratio. These 
parameters will be formally defined in Sect. 3. 

Incremental Learning. Ours is an incremental learning task which is best solved 
with an incremental learning technique, but can also be solved with a batch 
learner [12]. As we did not have a working implementation of a purely incre- 
mental rule learner (e.g., AQll [26], AQll-PM [22]) we decided to use a “batch- 
incremental” approach. 

In this approach we add subsequent training examples to the training set and 
build the classifier using the entire training set as new examples become available. 
It would not be feasible to rebuild the classifier after each new training example, 
therefore we handle training examples in batches. The size of such batches can 
be either constant or dependent on the current performance of the classifier. In 
our case we focused on the second approach. We evaluate the current classifi- 
cation accuracy and, if it drops below a user-defined threshold, we rebuild the 
classifier using the entire training set. Note that the weighted accuracy is more 
suitable than the accuracy measure for cost-sensitive learning. Hence, the pa- 
rameter controlling “batch-incremental” learning is called the threshold weighted 
accuracy. It will be formally defined in Sect. 3. 

The disadvantage of this technique is that the size of the training set grows 
infinitely during a system’s lifetime. A future work item of ours will be to limit 




Using Adaptive Alert Classification 111 



the number of training examples to a certain time window and use a technique 
called partial memory [22] to reduce the number of training examples. 

Summary. To summarize, we have not found a publicly available machine learn- 
ing technique that addresses all our requirements, in particular cost-sensitivity 
and incremental learning. Considering the remaining requirements the most suit- 
able techniques are rule learners. Based on desirable properties and successful ap- 
plications in similar domains, we decided to use RIPPER as our rule-learner. To 
circumvent its limitations with regard to our requirements, we used a technique 
called Weighting to implement cost-sensitivity and adjust for skewed class dis- 
tribution. We also implemented incremental learning as a “batch-incremental”, 
approach, whose batch size dependent on the current classification accuracy. 

3 Experimental Validation 

We have built a prototype implementation of ABAC in recommender and agent 
mode using the Weka framework [36]. The prototype has been validated with 
synthetic and real intrusion detection data and we summarize the results ob- 
tained in this section. 

Similar to the examples used throughout this paper, our prototype focuses on 
binary classification only, that is on classifying alerts into true and false positives. 
This does not affect the generality of the system, which can be used in multi-class 
classification. However, it simplifies the analysis of a system’s performance. We 
have not evaluated the classification performance in a multi-class classification. 

So far we have referred to alerts related to attacks as true positives and alerts 
mistakenly triggered by benign events as false positives. To avoid confusion with 
the terms used to evaluate our system, we henceforth refer to true positives 
as true alerts and false positives as false alerts, respectively. This allows us to 
use the terms true and false positives for measuring the quality of the alert 
classification. 

More formally, we introduce a confusion matrix C to evaluate the perfor- 
mance of our system. Rows in C represent actual class labels and columns rep- 
resent class labels assigned by the system. Element C[i,j] represents the number 
of instances of class i classified as class j by the system. For a binary classification 
problem, the elements of the matrix are called true positives {tp), false negatives 
(/n), false positives (fp) and true negatives (tn) as shown in Table 1(a). 

For cost-sensitive classification we introduce a cost matrix Co with identical 
meaning of rows and columns. The value of Co[i,j] represents the cost of assign- 
ing a class j to an example belonging to class i. Most often the cost of correct 
classification is zero, i.e., Co[i,i] = 0. In such cases, for binary classifications 
(Table 1(b)), there are only two values in the matrix: C 21 (cost of misclassifying 
a false alert as a real one) and C 12 (cost of misclassifying a true alert as a false 
one). 

In the remainder of the paper we use the following measures defined on cost 
and confusion matrices: true positive rate (TP), false positive rate (FP), false 
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Table 1. Confusion and cost matrices for alert classification. The positive class (+) 
denotes true alerts and the negative class (-) denotes false alerts. The columns represent 
classes assigned given by the system; the rows represent actual classes. 



(a) Confusion matrix C 
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negative rate (FN). We also use cost ratio {CR), which represents the ratio of 
the misclassification cost of false positives to false negatives, and its inverse - 
inverse cost ratio (ICR), which we found more intuitive for intrusion detection. 
For cost-sensitive classification we used a commonly used evaluation measure - 
so-called weighted accuracy {W A). Weighted accuracy expresses the accuracy of 
the classification with misclassifications weighted by their misclassification cost. 
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3.1 Data Sources 

We used Snort [31] - an open-source network-based IDS - to detect attacks and 
generate alerts. We purposely used the basic out-of-the box configuration and 
rule set to demonstrate the performance of ALAC in reducing the amount of 
false positives and therefore reducing time-consuming IDS tuning. 

Snort was run on two data sources: a synthetic one known as DARPA 
1999 [19], and a real-world one - namely the traffic observed in a medium- 
sized corporate network (called Data Set B). Alerts are represented as tuples of 
attribute values, with the following seven attributes: signature name, source and 
destination IP addresses, a flag indicating whether an alert is a scan, number of 
scanned hosts and ports and the scan time. 

DARPA 1999 Data Set is a synthetic data set collected from a simulated medium- 
sized computer network in a fictitious military base. The network was connected 
to the outside world by a router. The router was set to open policy, i.e., not 
blocking any connections. The simulation was run for 5 weeks which yielded 
three weeks of training data and two weeks of testing data. Attack truth tables 
describing the attacks that took place exist for both periods. DARPA 1999 data 
consists of: two sets of network traffic (files with tcpdump [14] data) both inside 
and outside the router, BSM and NT audit data, and directory listings. 

In our experiments we ran Snort in batch mode using traffic collected from 
outside the router for both training and testing periods. Note that Snort missed 
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some of the attacks in this dataset. Some of them could only be detected using 
host-based intrusion detection, whereas for others Snort simply did not have the 
signature. It is important to note that our goal was not to evaluate the detection 
rate of Snort on this data, but to validate our system in a realistic environment. 

The DARPA 1999 data set has many well-known weaknesses (e.g., [21,25]) 
and we want to make sure that using it we get representative results for how 
ALAC performs in real-world environments. To make this point we analyze how 
the weaknesses identified by McHugh [25] , namely the generation of attack and 
background traffic, the amount of training data for anomaly based systems, 
attack taxonomy and the use of ROC analysis; can affect ALAC. 

With respect to the training and test data, we use both training and test data 
for the incremental learning of ALAC, so that we have sufficient data to train 
the system. With respect to attack taxonomy, we are not using the scoring used 
in the original evaluation, and therefore attack taxonomy is of less significance. 
Finally, we use ROC analysis correctly. 

The problem of the simulation artifacts is more thoroughly analyzed by Ma- 
honey and Chan [21] thus we use their work to understand how these artifacts 
can affect ALAC. These artifacts manifest themselves in various fields, such as 
the TCP and IP headers and higher protocol data. Snort, as a signature based 
system, does not take advantage of these artifacts and ALAC sees only a small 
subset of them, namely the source IP address. We verified that the rules learned 
by ALAC seldom contain a source IP address and therefore the system does not 
take advantage of simulation artifacts present in source IP addresses. On the 
other hand, we cannot easily estimate how these regularities affect aggregates 
used in the background knowledge. This is still an open issue. 

We think that the proper analysis of these issues is beyond the scope of our 
work, and would also require comparing multiple real-world data sets. DARPA 
1999 data set is nonetheless valuable for evaluation of our research prototype. 

Data Set R is a real-world data set collected over the period of one month in a 
medium-sized corporate network. The network connects to the Internet through 
firewalls and to the rest of the corporate intranet and does not contain any 
externally accessible machines. Our Snort sensor recorded information exchanged 
between the Internet and the intranet. Owing to privacy issues this data set 
cannot be shared with third parties. We do not claim that it is representative 
for all real-world data sets, but it is an example of a real data set on which our 
system could be used. Hence, we are using this data set as a second validation 
of our system. 

3.2 Alert Labeling 

Our system assumes that alerts are labeled by the analyst. In this section we 
explain how we labeled alerts used to evaluate the system (the statistics for both 
datasets are shown in Table 2). 

DARPA 1999 Data Set. In a first step we generated alerts using Snort running 
in batch mode and writing alerts into a relational database. In the second step 
we used automatic labeling of IDS alerts using the provided attack truth tables. 
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Table 2. Statistics generated by the Snort sensor with DARPA 1999 data set and 
Data set B. 





DARPA 1999 


Data Set B 


Duration of experiment: 


5 weeks 


1 month 


Number of IDS alerts: 


59812 


47099 


False alerts: 


48103 


33220 


True alerts: 


11709 


13383 


Unidentified: 


- 


496 



For labeling, we used an automatic approach which can be easily reproduced 
by researchers in other environments, even with different IDS sensors. We con- 
sider all alerts meeting the following criteria related to an attack: (i) matching 
source IP address, (ii) matching destination IP address and (iii) alert time stamp 
in the time window in which the attack has occurred. We masked all remain- 
ing alerts as false alerts. While manually reviewing the alerts we found that, in 
many cases, the classification is ambiguous (e.g., a benign PING alert can be 
as well classified as malicious if it is sent to the host being attacked) . This may 
introduce an error in class labels. 

Note that different attacks triggered a different number of alerts (e.g., wide 
network scans triggered thousands of alerts). For the evaluation of our system 
we discarded the information regarding which alerts belong to which attack and 
labeled all these alerts as true alerts. 

Data Set B. We generated these alerts in real-time using Snort. As opposed to 
the first data set we did not have information concerning attacks. The alerts 
have been classified based on the author’s expertise in intrusion detection into 
groups indicating possible type and cause of the alert. There was also a certain 
number of alerts that could not be classified into true or false positives. Similarly 
to the first data set we used only binary classification to evaluate the system, 
and labeled the unidentified alerts as true positives. 

Note that this data set was collected in a well maintained and well protected 
network with no direct Internet access. We observed a low number of attacks in 
this network, but many alerts were generated. We observed that large groups of 
alerts can be explained by events such as a single worm infection and unautho- 
rized network scans. The problem of removing such redundancy can be solved 
by so-called alert correlation systems [5,7,34], where a group of alerts can be 
replaced by a meta-alert representative of the alerts in the group, prior to clas- 
sification. The topic of alert correlation is beyond the scope of this paper and 
will be addressed as a future work item. 

Another issue is that the classification of alerts was done by only one analyst 
and therefore may contain errors. This raises the question of how such classifi- 
cation errors affect the performance of ALAC. To address this issue, one can ask 
multiple analysts to classify the dataset independently. Then the results can be 
compared using interrater reliability analysis. 
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3.3 Background Knowledge 

We decided to focus on the first two types of background knowledge presented in 
Sect. 2.1, namely network topology and alert context. Owing to a lack of required 
information concerning installed software, we decided not to implement matching 
alert semantics with installed software. This would also be a repetition of the 
experiments by Lippmann et al. [20] . 

As discussed in Sect. 3.1, we used an attribute- value representation of alarms 
with the background knowledge represented as additional attributes. Specifically, 
the background knowledge resulted in 19 attributes, which are calculated as 
follows: 

Classification of IP addresses resulted in an additional attribute for both 
source and destination IP classifying machines according to their known 
subnets (e.g., Internet, intranet, DMZ). 

Classification of hosts resulted in additional attributes indicating the oper- 
ating system and the host type for known IP addresses. 

Aggregates! resulted in additional attributes with the number of alerts in 
the following categories (we calculated these aggregates for alerts in a time 
window of 1 minute) : 

— alerts with the same source IP address, 

— alerts with the same destination IP address, 

— alerts with the same source or destination IP address, 

— alerts with the same signature, 

— alerts classified as intrusions. 

Aggregates2,3 were calculated similarly to the first set of attributes, but in 
time windows of 5 and 30 minutes, respectively. 

This choice of background knowledge, which was motivated by heuristics 
used in alert correlation systems, is necessarily a bit ad-hoc and reflects the 
author’s expertise in classifying IDS attacks. As this background knowledge is 
not especially tailored to training data, it is natural to ask how useful it is 
for alert classification. We discuss the answer to this question in the following 
sections. 



3.4 Results Obtained with DARPA 1999 Data Set 

Our experiments were conducted in two stages. In the first stage we evaluated the 
performance of the classifier and the influence of adding background knowledge 
to alerts on the accuracy of classification. The results presented here allowed us to 
set some parameters in ALAC. In the second stage we evaluated the performance 
of ALAC in recommender and agent mode. 

Background Knowledge and Setting ALAC Parameters. Here we describe the 
results of experiments conducted to evaluate background knowledge and to set 
ALAC parameters. Note that in the experiments we used only the machine 
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Fig. 2. ROC curves for classifier used with different types of background knowledge. 



learning component of ALAC, namely a RIPPER module, to build classifiers for 
the entire data set. Hereafter we refer to these results as batch classification. 

Since the behavior of classifiers depends on the assigned costs, we used ROC 
(Receiver Operating Characteristic) analysis [29] to evaluate the performance of 
our classifier for different misclassification costs. Figure 2(a) shows the perfor- 
mance of the classifier using data with different amounts of background knowl- 
edge. Each curve was plotted by varying the cost ratio for the classifier. Each 
point in the curve represents results obtained from 10-fold cross validation for a 
given misclassification cost and type of background knowledge. 

As we expected, the classifier with no background knowledge (plus series) 
performs worse than the classifier with simple classifications of IP addresses 
and operating systems running on the machines (cross series) in terms of false 
positives. Using the background knowledge consisting of the classifications above 
and aggregates introduced in Sect. 3.3 significantly reduces the false positive rate 
and increases the true positive rate (star series). Full background knowledge 
(having additional aggregates in multiple time windows) performs comparably 
to the reduced one (star vs. box series). In our experiments with ALAC we 
decided to use full background knowledge. 

ROC curves show the performance of the system under different misclassifi- 
cation costs, but they do not show how the curve was built. Recall from Sect. 2.2 
that we use the inverse cost ratio in Weighting to make RIPPER cost sensitive 
and varied this parameter to obtain a multiple points on the curve. We used this 
curve to select good parameters of our model. 

ALAC is controlled by a number of parameters, which we had to set in 
order to evaluate its performance. To evaluate the performance of ALAC as an 
incremental classifier we first selected the parameters of its base classifier. 

The performance of the base classifier at various costs and class distributions 
is depicted by the ROC curve and it is possible to select an optimal classifier for 
a certain cost and class distribution [11]. As these values are not defined for our 
task, we could not select an optimal classifier using the above method. Therefore 
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we arbitrarily selected a base classifier that gives a good tradeoff between false 
positives and false negatives, for ICR = 50. 

The second parameter is the threshold weighted accuracy {W A) for rebuild- 
ing the classifier (see Sect. 2.2). The value of threshold weighted accuracy should 
be chosen carefully as it represents a tradeoff between classification accuracy and 
how frequently the machine learning algorithm is run. We chose the value equal 
to the accuracy of a classifier in batch mode. Experiments not documented here 
showed that using higher values increases the learning frequency with no signif- 
icant improvement in classification accuracy. 

We assumed that in real-life scenarios the system would work with an initial 
model and only use new training examples to modify its model. To simulate this 
we used 30% of input data to build the initial classifier and the remaining 70% 
to evaluate the system. 

ALAC in Recommender Mode. In recommender mode the analyst reviews each 
alert and corrects ALAC misclassifications. We plotted the number of misclassi- 
fications: false positives (Fig. 3(a)) and false negatives (Fig. 3(b)) as a function 
of processed alerts. 

The resulting overall false negative rate {FN = 0.024) is much higher than 
the false negative rate for the batch classification on the entire data set {FN = 
0.0076) as shown in Fig. 2(a). At the same time, the overall false positive rate 
{FP = 0.025) is less than half of the false positive rate for batch classification 
{FP = 0.06). These differences are expected due to different learning and evalu- 
ation methods used, i.e., batch incremental learning vs. 10-fold cross validation. 
Note that both ALAC and a batch classifier have a very good classification 
accuracy and yield comparable results in terms of accuracy. 





Fig. 3. False negatives and false positives for ALAC in agent and recommender modes 
(DARPA1999 data set, ICR=50). 



ALAC in Agent Mode. In agent mode ALAC processes alerts autonomously 
based on criteria defined by the analyst, described in Sect. 2.1. We configured 
the system to forward all alerts classified as true alerts and false alerts classified 
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with low confidence (confidence < cth) to the analyst. The system discarded all 
other alerts, i.e., false alerts classified with high confidence, except for a fraction 
k of randomly chosen alerts, which were also forwarded to the analyst. 

Similarly to the recommender mode, we calculated the number of misclassi- 
fications made by the system. We experimented with different values of Cth and 
sampling rates k. We then chose Cth = 90% and three sampling rates k: 0.1, 
0.25 and 0.5. Our experiments show that the sampling rates below 0.1 make the 
agent misclassify too many alerts and significantly changes the class distribution 
in the training examples. On the other hand, with sampling rates much higher 
than 0.5, the system works similarly to recommender mode and is less useful for 
the analyst. 

Notice that there are two types of false negatives in agent mode - the ones 
corrected by the analyst and the ones the analyst is not aware of because the 
alerts have been discarded. We plotted the second type of misclassification as an 
error bar in Fig. 3(a). Intuitively with lower sampling rates, the agent will have 
fewer false negatives of the first type, in fact missing more alerts. As expected 
the total number of false negatives is lower with higher sampling rates. 

We were surprised to observe that the recommender and the agent have sim- 
ilar false positive rates (FP = 0.025 for both cases) and similar false negative 
rates, even with low sampling rates (FN = 0.026 for k = 0.25 vs. FN = 0.025). 
This seemingly counterintuitive result can be explained if we note that auto- 
matic processing of alerts classified as false positives effectively changes the class 
distribution in training examples in favor of true alerts. As a result the agent 
performs comparably to the recommender. 




(a) DARPA1999 data set, 7Ci?=50 
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Fig. 4. Number of alerts processed autonomously by ALAC in agent mode. 



As shown in Fig. 4(a), with the sampling rate of 0.25, more than 45% of false 
alerts were processed and discarded by ALAC. At the same time the number 
of unnoticed false negatives is half the number of mistakes for recommender 
mode. Our experiments show that the system is useful for intrusion detection 
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analysts as it significantly reduces number of false positives, without making 
many mistakes. 



3.5 Results Obtained with Data Set B 

We used the second dataset as an independent validation of the system. To 
avoid “fitting the model to the data” we used the same set of parameters as for 
the first data set. However, a ROC curve in Fig. 2(b) shows that the classifier 
achieves much higher true positive rate and much lower false negative rate than 
for the first data set, which means that Data Set B is easier to classify. The likely 
explanation of this fact is that Data Set B contains fewer intrusions and more 
redundancy than the first data set. 

Notice that the ROC curve consists of two distinct parts. An analysis shows 
that the left part corresponds to RIPPER run for small ICRs, where it learns 
the rules describing true alerts. The right part of the curve corresponds to high 
ICRs, where RIPPER learns the rules describing false alerts. Better performance 
in the first case can be explained by the fact that the intrusions in this data set 
are more structured and therefore easier to learn. On the other hand, false alerts 
are more difficult to describe and hence the performance is poorer. 



Background Knowledge and Setting ALAC Parameters. Results with ROC anal- 
ysis (Fig. 2(b)) show that the classifier correctly classifies most of the exam- 
ples, and adding background knowledge has little effect on classification. To 
have the same conditions as with the first data set, we nonetheless decided 
to use the full background knowledge. We also noticed that ICR = 50 is not 
the optimal value for this dataset as it results in a high false positive rate 
{FN = 0.002, FP= 0.05). 

We observed that ALAC, when run with 30% of the alerts as an initial 
classifier, classified the remaining alerts with very few learning runs. Therefore, 
to demonstrate its incremental learning capabilities, we decided to lower the 
initial amount of training data from 30% to 5% of all the alerts. 



ALAC in Recommender Mode. Figure 5 shows that in recommender mode the 
system has a much lower overall false negative rate {FN = 0.0045) and a higher 
overall false positive rate {FP = 0.10) than for DARPA 1999 data set, which is 
comparable to the results of the classification in batch mode. We also observed 
that the learning only took place for approximately the first 30% of the entire 
data set and the classifier classified the remaining alerts with no additional 
learning. This phenomena can also be explained by the fact that Data Set B 
contains more regularities and the classifier is easier to build. 

This is different in the case of the DARPA1999 data set, where the classifier 
was frequently rebuilt in the last 30% of the data. For DARPA1999 data set the 
behavior of ALAC is explained by the fact that most of the intrusions actually 
took place in the last two weeks of the experiment. 
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ALAC in Agent Mode. In agent mode we obtained results similar to those in 
recommender mode, with a great number of alerts being processed autonomously 
by the system {FN = 0.0065, FP = 0.13). As shown in Fig. 4(b), with the 
sampling rate of 0.25, more than 27% of all alerts were processed by the agent. 
At the same time the actual number of unnoticed false negatives is one third 
smaller than the number of false negatives in recommender mode. This confirms 
the usefulness of the system tested with an independent data set. 

Similarly to observation in Sect. 3.4 with lower sampling rates, the agent will 
have seemingly fewer false negatives, in fact missing more alerts. As expected the 
total number of false negatives is lower with higher sampling rates. This effect 
is not as clearly visible as with DARPA1999 data set. 




Alerts Processed 




Alerts Processed 



Fig. 5. False negatives and false positives for ALAC in agent and recommender modes 
(Data set B, ICR=50). 



3.6 Understanding the Rules 

One requirement of our system was that the rules can be reviewed by the analyst 
and their correctness can be verified. The rules built by RIPPER are generally 
human interpretable and thus can be reviewed by the analyst. Here is a repre- 
sentative example of two rules used by ALAC: 

(cnt_intr_wl <= 0) and (cnt_sign_w3 >= 1) and (cnt_sign_wl >= 1) 
and (cnt_dstIP_wl >= 1) => class=FALSE 
(cnt_srcIP_w3 <= 6) and (cnt_int_w2 <= 0) and (cnt_ip_w2 >= 2) 
and (sign = ICMP PING NMAP) => cIass=FALSE 

The first rule reads as follows: If a number of alerts classified as intrusions in the 
last minute (window wl) equals zero and there have been other alerts triggered 
by a given signature and targeted at the same IP address as the current alert, 
then the alert should be classified as false positive. The second rule says that, 
if the number of NMAP PING alerts originating from the same IP address is less 
than six in the last 30 minutes (window w3), there have been no intrusions in 
the last 5 minutes (window w2) and there has been at least 1 alert with identical 
source or destination IP address, then the current alert is false positive. 
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These rules are intuitively appealing: If there have been similar alerts recently 
and they were all false alerts, then the current alert is also a false alert. The 
second rule says that if the number of NMAP PING alerts is small and there has 
not been any intrusions recently, then the alert is a false alert. 

We observed that the comprehensibility of rules depends on several factors 
including the background knowledge and the cost ratio. With less background 
knowledge RIPPER learns more specific and difficult to understand rules. The 
effect of varying cost ratio is particularly apparent for rules produced while 
constructing the ROC curve, where RIPPER induces rules for either true or 
false alerts. This is due to the use of RIPPER running in ordered rule set mode. 

4 Conclusions and Future Work 

We presented a novel concept of building an adaptive alert classifier based on 
an intrusion detection analyst’s feedback using machine learning techniques. We 
discussed the issues of human feedback and background knowledge, and reviewed 
machine learning techniques suitable for alert classification. Finally, we presented 
a prototype implementation and evaluated its performance on synthetic as well 
as real intrusion data. 

We showed that background knowledge is useful for alert classification. The 
results were particularly clear for the DARPA 1999 data set. For the real-world 
dataset, adding background knowledge had little impact on the classification 
accuracy. The second set was much easier to classify, even with no background 
knowledge. Hence, we did not expect improvement from background knowledge 
in this case. We also showed that the system is useful in recommender mode, 
where it adaptively learns the classification from the analyst. For both datasets 
we obtained false negative and false positive rates comparable to batch classifi- 
cation. Note that in recommender mode all system misclassifications would have 
been corrected by the analyst. 

In addition, we found that our system is useful in agent mode, where some 
alerts are autonomously processed (e.g., false positives classified with high con- 
fidence are discarded). More importantly, for both data sets the false negative 
rate of our system is comparable to that in the recommender mode. At the same 
time, the number of false positives has been reduced by approximately 30%. 

The system has a few numeric parameters that influence its performance 
and should be adjusted depending on the input data. In the future, we intend to 
investigate how the value of these parameters can be automatically determined. 
We are also aware of the limitations of the data sets used. We aim to evaluate 
the performance of the system on the basis of more realistic intrusion detection 
data and to integrate an alert correlation system to reduce redundancy in alerts. 
Our system uses RIPPER, a noise-tolerant algorithm, but the extent to which 
ALAC can tolerate errors in the data, is currently unknown. We will address this 
issue by introducing an artificial error and observing how it affects the system. 

The topic of learning comprehensible rules is very interesting and we plan 
to investigate it further. We are currently looking at learning multiple classifiers 
for each signature and using RIPPER in unordered rule set mode. 
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In the machine learning part we intend to focus on the development of incre- 
mental machine learning technique suitable for learning a classifier for intrusion 
detection. Initially we want to perform experiments with partial memory tech- 
nique and batch classifiers. Later we will focus on truly incremental techniques. 
It is important that such techniques be able to incorporate the required back- 
ground knowledge. 
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Abstract. Attack analysis is a challenging problem, especially in emerg- 
ing environments where there are few known attack cases. One such new 
environment is the Mobile Ad hoc Network (MANET). In this paper, 
we present a systematic approach to analyze attacks. We introduce the 
concept of basic events. An attack can be decomposed into certain com- 
binations of basic events. We then define a taxonomy of anomalous basic 
events by analyzing the basic security goals. 

Attack analysis provides a basis for designing detection models. We use 
both specification-based and statistical-based approaches. First, normal 
basic events of the protocol can be modeled by an extended finite state 
automaton (EFSA) according to the protocol specihcations. The EFSA 
can detect anomalous basic events that are direct violations of the spec- 
ifications. Statistical learning algorithms, with statistical features, i.e., 
statistics on the states and transitions of the EFSA, can train an ef- 
fective detection model to detect those anomalous basic events that are 
temporal and statistical in nature. 

We use the AODV routing protocol as a case study to validate our 
research. Our experiments on the MobiEmu wireless emulation plat- 
form show that our specification-based and statistical-based models cover 
most of the anomalous basic events in our taxonomy. 

Keywords: MANET, Attack Analysis, Intrusion Detection, Routing Se- 
curity, AODV 



1 Introduction 

Network protocol design and implementation have become increasingly complex. 
Consequently, securing network protocols requires detailed analysis of normal 
protocol operations and vulnerabilities. The process is tedious and error-prone. 
Traditional attack analysis categorizes attacks based on knowledge of known in- 
cidents. Therefore, such analysis cannot be applied to new (unknown) attacks. 
The problem is even more serious in new environments where there are very few 
known attacks. Mobile ad hoc networking (MANET) is such an example. An 
ad hoc network consists of a group of autonomous mobile nodes with no infras- 
tructure support. Recently, many MANET applications have emerged, such as 
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battlefield operations, personal digital assistant (PDA) communication, among 
others. MANET and its applications are very different from traditional network 
and applications. They are also more vulnerable due to their unique characteris- 
tics, such as open physical medium, dynamic topology, de-centralized computing 
environment, and lack of a clear line of defense. Recent research efforts, such 
as [Zap01,HPJ02] attempt to apply cryptography techniques to secure MANET 
routing protocols. However, existing experience in wired security has already 
taught us the necessity of defense-in-depth because there are always human er- 
rors and design flaws that enable attackers to exploit software vulnerability. 
Therefore, it is also necessary to develop detection and response techniques for 
MANET. 

Designing an effective intrusion detection system (IDS), as well as other se- 
curity mechanisms, requires a deep understanding of threat models and adver- 
saries’ attack capabilities. We note that since MANET uses a TCP/IP stack, 
many well-known attacks can be applied to MANET but existing security mea- 
sures in wired networks can address these attacks. On the other hand, some 
protocols, especially routing protocols, are MANET specific. Very few attack 
instances of these protocols have been well studied. It follows that traditional 
attack analysis cannot work effectively. In this paper, we propose a new attack 
analysis approach by decomposing a complicated attack into a number of basic 
components called basic events. Every basic event consists of casually related 
protocol behavior and uses resources solely within a single node. It is easier to 
study the protocol behavior more accurately from the point of view of a single 
node. Specifically, we study the basic routing behavior in MANET. We propose 
a taxonomy of anomalous basic events for MANET, which is based on poten- 
tial targets that attackers can compromise and the security goals that attackers 
attempt to compromise for each target. 

Based on the taxonomy, we build a prototype IDS for MANET routing pro- 
tocols. We choose one of the most popular MANET routing protocols, AODV, 
as a case study. We develop specifications in the form of an extended finite state 
automaton (EFSA) from AODV IETF Draft [PBRD03] . We apply two detection 
approaches which use the EFSA in different ways. First, we can detect violations 
of the specification directly, which is often referred to as a specification-based 
approach. Second, we can also detect statistical anomalies by constructing sta- 
tistical features from the specification and apply machine learning methods. This 
statistical-based approach is more suitable for attacks that are temporal and 
statistical in nature. 

In short, our main contribution is the concept of basic events and its use in 
attack taxonomy analysis. We also show how to use protocol specifications to 
model normal basic events and derive features from the specification to design 
an intrusion detection system. 

We use MobiEmu [ZL02] as our evaluation platform for related experiments. 
MobiEmu is an experimental testbed that emulates MANET in a wired network. 
It shows that our approach involves a much smaller set of features in order to 
capture the same set of attacks, compared with our previous work in developing 
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IDS for MANET [HFLY02] that attempted an exhaustive search of features 
without the help of taxonomy and protocol specification. As the feature set is 
smaller and derived directly from the protocol specification, it has the additional 
advantage that domain experts can review it. This further improves accuracy. 

The rest of the paper is organized as follows. Section 2 discusses related 
concepts of basic events and presents a taxonomy of anomalous basic events in 
MANET. Section 3 presents an AODV EFSA specification. Section 4 describes 
the design of a MANET IDS, experiments and results. Finally, related work and 
conclusions are discussed in Sections 5 and 6. 

2 Taxonomy of Anomalous Basic Events 

2.1 Concepts 

Anomalies or attacks can be categorized using different criteria. Since there is no 
well-established taxonomy yet in MANET, we describe a systematic approach 
to study MANET attacks based on the concept of anomalous basic events. 
We use MANET routing as the subject of our study. 

A routing process in MANET involves causally related, cooperative op- 
erations from a number of nodes. For example, the Route Discovery process, 
frequently appeared in on-demand routing protocols [JMB01,PBRD03], consists 
of chained actions from the source node to the destination node (or an inter- 
mediate node who knows a route to the destination) and back to the source 
node. Such process can be decomposed into a series of basic routing events. A 
basic routing event is defined as an indivisible local segment of a routing pro- 
cess. More precisely, it is the smallest set of causally related routing operations 
on a single node. We will use the term basic event for short. Therefore, the 
Route Discovery process can be decomposed into the following basic events: 1) 
The source node delivers an initial Route Request; 2) Each node (except for the 
source node and the node that has a route to the destination) in the forward 
path receives a Route Request from the previous node and forwards it; 3) The re- 
plying node receives the Route Request and replies with a Route Reply message; 
4) An intermediate node in the reverse path receives a Route Reply message and 
forwards it; 5) Finally, the source node receives the Route Reply message and 
establishes a route to the destination. 

Note that a basic (routing) event may contain one or more operations, such 
as receiving a packet, modify a routing parameter, or delivering a packet. How- 
ever, the integrity of routing logic requires every basic event be conducted in a 
transaction fashion. That is, it is considered successful (or normal) if and only 
if it performs all of its operations in the specified order. We assume that certain 
system specification exists which specifies normal protocol behavior. As we 
will show later in the paper, system specification can be represented in the form 
of an extended finite state machine; a (normal) basic event maps to a single tran- 
sition in a given extended finite state machine. We further note that to define 
a basic event, operations are restricted to the scope of a single MANET node 
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because only local data source can be fully trusted by the intrusion detection 
agent on the same node. 

On the other hand, an anomalous basic event is a basic event that does 
not follow the system specification. Obviously, it is useful to study anomalous 
basic events in order to capture the characteristics of basic attack components. 
Nevertheless, we note that it is possible that some attacks do not trigger any 
anomalous basic events. For example, an attack may involve elements from a 
different layer that the system specification does not describe, or it may involve 
knowledge beyond a single node. A Wormhole attack [HPJOl] is an example 
of the first case, where two wireless nodes can create a hidden tunnel through 
wires or wireless links with stronger transmission power. A network scan on 
known (vulnerable) ports is an example of the latter case because each single 
node observes only legitimate uses. To deal with these issues, we plan to work 
on a multiple layer and global intrusion detection system. 

2.2 Taxonomy of Anomalous Basic Events 
in MANET Routing Protocols 

We identify an anomalous basic event by two components, its target and op- 
eration. A protocol agent running on a single node has different elements to 
operate on, with different semantics. The routing behavior of MANET typically 
involves three elements or targets: routing messages, data packets and routing 
table ( or routing cache ) entries. Furthermore, we need to study what are the pos- 
sible attack operations on these targets. Individual security requirements can be 
identified by examining the following well-known security goals: Confidential- 
ity, Integrity and Availability. We summarize possible combinations of routing 
targets and operations in Table 1. In this table, we list three basic operations 
for Integrity compromise: add, delete and change. The exact meanings of these 
operations need to be interpreted properly in the context of individual targets. 

Conceptually, we can characterize a normal basic event in a similar way, i.e., 
its target and its operation type. Nevertheless, many different normal operations 
can be applied and it is hard to find a universal taxonomy of normal operations 
for all system specification. Thus, a more logical way is to represent normal basic 
events with a different structure, such as the extended state machine approach 
we introduce in Section 3. 

In MANET routing security, cryptography addresses many problems, espe- 
cially those involving confidentiality and integrity issues on data packets. In- 
trusion detection techniques are more suitable for other security requirements. 
Availability issue, for example, is difficult for protection techniques because at- 
tack packets appear indistinguishable from normal user packets. Some integrity 
problems also require non-cryptographic solutions for efficiency reasons. For ex- 
ample, an attacker can compromise the routing table in a local node and change 
the cost of any specific route entry. It may change the sequence number or a 
hop count so that some specific route appears more attractive than other valid 
routes. Encrypting every access operation on routing entries could be too ex- 
pensive. Intrusion detection solutions can better address these issues, based on 
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existing experience in the wired networks. We identify a number of anomalous 
basic events that are more suitable for intrusion detection systems in bold face 
in Table 1. 

There are two types of anomalous basic events marked by asterisks in the 
table, Fabrication of Routing Messages and Modification of Routing Messages. 
There are cryptographic solutions for these types of problems, but they are not 
very efficient and sometimes require an expensive key establishment phase. We 
want to study them in our IDS work because they are related to the routing 
logic and we can see later that some attacks in these categories can be detected 
easily. 



Table 1. Taxonomy of Anomalous Basic Events 



Compromises to 
Security Goals 


1 Events by Targets | 


Routing Messages 


Data Packets 


Routing Table Entries 


Confident iality 


Location Disclosure 


Data Disclosure 


N/A 


Integrity 


Add 


Fabrication* 


Fabrication 


Add Route 


Delete 


Interruption 


Interruption 


Delete Route 


Change 


Modification* 


Modification 


Change Route Cost 


Rushing 


1 Availability | 


Flooding 


Flooding 


Routing Table Overflow 



We examine a number of basic MANET routing attacks noted in the lit- 
erature [HFLY02,NS03,TBK+03]. By comparing them (shown in Table 2) with 
taxonomy in Table 1, we find they match very well with the definitions of anoma- 
lous basic events. We refer to each attack with a unique name and optionally a 
suffix letter. For example, “Route Flooding (S)” is a flooding attack of routing 
messages that uses a unique source address. 

In addition, we consider a number of more complex attack scenarios that 
contains a sequence of anomalous basic events. We use some examples studied 
by Ning and Sun [NS03]. These attack scenarios are summarized in Table 3. 

As a case study, we analyze AODV [PBRD03], a popular MANET routing 
protocol. We analyze its designed behavior using an extended finite state automa- 
ton approach. This is inspired by the work on TCP/IP protocols in [SGF+02]. 

3 A Specification of the AODV Protocol 

3.1 An Overview of Extended Finite State Automaton (EFSA) 

Specification-based approach provides a model to analyze attacks based on pro- 
tocol specifications. Similar to the work by Sekar et al. for TCP/IP proto- 
cols [SGF+02], we also propose to model the AODV protocol with an EFSA 
approach. 

An extended finite state automaton (EFSA) is similar to a finite-state ma- 
chine except that transitions and states can carry a finite set of parameters. 
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Table 2. Basic MANET Attacks, where suffix letters stand for different attack vari- 
ations. R, S, D stand for randomness, source only and destination only, respectively. 
Other letters include M (maximal value), F (failure), Y (reply), I (invalid) and N (new) 



Attacks 


Attack Description 


Corresponding 
Anomalous Basic 
Events 


Active Reply 


A Route Reply is actively forged with no 
related incoming Route Request messages. 


Fabrication of Routing 
Messages 


False Reply 


A Route Reply is forged for a Route Request 
message even though the node is not supposed 
to reply. 


Route Drop (R) 


Drop routing packets. (R) denotes a random 
selection of source and destination addresses. 


Interruption of Routing 
Messages 




A fixed percentage of routing packets with a 
specific source address are dropped. (S) stands 
for source address. 




A fixed percentage of routing packets with a 
specific destination address are dropped. (D) 
stands for destination address. 


Modify Sequence (R) 


Modify the destination’s sequence number 
randomly. (R) stands for randomness. 


Modification of Routing 
Messages 


Modify Sequence (M) 


Increase the destination’s sequence number to 
the largest allowed number. (M) stands for the 
maximal value. 


Modify Hop 


Change the hop count to a smaller value. 


Rushing (F) 


Shorten the waiting time for Route Replies 
when a route is unavailable. (F) stands for 
failure. 


Rushing of Routing 
Messages 




Shorten the waiting time to send a Route Reply 
after a Route Request is received. (Y) stands 
for reply. 


Route Flooding (R) 


Flood with both source and destination 
addresses randomized. 


Flooding of Routing 
Messages 




Flood with the same source address and 
random destination addresses. 




k’lood to a single destination with random 
source addresses. 


Data Drop (R | S | D) 


Similar to Route Drop (R), Route Drop (S), or 
Route Drop (D), but using data packets. 


Interruption of Data 
Packets 


Data Flooding (R | S | 

D) 


Similar to Route Flooding (R), Route Flooding 
(S), or Route Flooding (D), but using data 
packets. 


Flooding of Data 
Packets 


Add Route (I) 


An invalid route entry is randomly selected and 
validated. (I) stands for invalid. 


Add Route of Routing 
Table Entries 




A route entry is added directly with random 
destination address. (N) stands for new. 


Delete Route 


A random valid route is invalidated. 


Delete Route of 
Routing Table Entries 


Change Sequence (R | 
M) 


Similar to Modify Sequence attacks but the 
sequence number is changed directly on the 
routing table. 


Change Route Cost of 
Routing Table Entries 


Change Hop 


Similar to Modify Hop, but the hop count is 
changed directly on the routing table. 


Overflow Table 


Add excessive routes to overflow the routing 
table. 


Routing Table Overflow 
of Routing Table 
Entries 



Conventionally, we call them transition parameters and state variables. We can 
derive EFSA from documentation, implementations, RFCs or other materials. 

Furthermore, we distinguish two types of transitions: input and output tran- 
sitions. Input transitions include packet-receiving events and output transitions 
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Table 3. More Complex MANET Attacks 



Attacks 


Attack Description 


Corresponding Anomalous Basic Events 


Route 

Invasion 


Inject a node in an 
active route. 


Fabrication of Routing Messages (two RREQs) 


Route Loop 


Create a route 
loop. 


Fabrication of Routing Messages (two RREPs) 


Partition 


Separate a network 
into two partitions. 


Fabrication of Routing Messages (RREP) 
Interruption of Data Packets 



include packet-delivery events. If there are no packet communication events in- 
volved in a transition (which can take place with a timeout, for example), it is 
also treated as an input transition. 

According to the original definition in [SGF“''02], input and output tran- 
sitions are separate transitions because only one event can be specified in a 
transition. Here we relax the definition of a transition by allowing a transi- 
tion to have both a packet-receiving event and a packet-delivery event (ei- 
ther of them can still be optional). The relaxed definition of a transition 6 is: 
S = {S_old ^ Smew, input_cond — > output_action}, where the old and new 
states are specified in S_old and Smew. The new definition assumes the follow- 
ing semantics. The output action, if defined, must be performed immediately 
after the input condition is met, and before the new state is reached. Unless the 
output action has accomplished, no other transitions are allowed. 

An input condition (inputmond) can specify timeouts or predicates and 
at most one packet-receiving event. It uses a C-like expression syntax where 
operators like &&, || etc., can be used. State variables (of the original state) 
and transition parameters can be accessed in input conditions. To distinguish, 
state variables start with lower case letters and transition parameters start with 
capitalized letters. Packet-receiving events, predicates and timeouts can be 
used as Boolean functions in input conditions. A packet-receiving event or a 
predicate has its own parameters, which must be matched with provided values, 
unless the value is a dash (-) , which specifies that the corresponding parameter 
can match any value. An output action (output_action) can specify state vari- 
able modifications, tasks and at most one packet-delivery event. Predicates and 
tasks refer to functionalities that we plan to implement later. An output action 
is a list of operations, which can be packet-delivery events, state variable 
assignments, or tasks. Either inputmond or output_action can be optional but 
at least one must be present. 

In addition, a number of auxiliary functions can be used in either input 
conditions or output actions. They are actually evaluated by IDS. We use aux- 
iliary functions simply to improve readability. 

Protocol state machines are in general non-deterministic, as one incoming 
packet can lead to multiple states. We solve non-determinism by introducing 
a set of finite state automata, which start from the same state, but fork into 
different paths when a state can have multiple transitions based on an incoming 
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event. For instance, in TCP, every extended finite state automaton corresponds 
to the state of a unique connection. In AODV, operations on a particular route 
entry to a single destination can be defined in an extended finite state automaton. 
For example, an incoming Route Reply message can add new routes to both the 
destination node and the previous hop node, thus the EFSAs for both nodes 
need to process this message. In addition, the Route Reply message may also be 
forwarded to the originator, which is conducted by a third EFSA corresponding 
to the originator. 

Clearly, the number of state machines can increase up to the number of 
possible nodes in the system if their lifetime is unbounded. Thus, we should 
remove unnecessary state machines to reduce memory usage. In AODV, a route 
entry is removed after it has been invalidated for a certain period. In other 
words, we can identify a final state from which no further progress could be 
made. Therefore, state machines reaching the final state can be deleted from the 
state machine repository safely. 

We construct an AODV EFSA by following the AODV Internet draft version 
13 [PBRD03]. AODV uses hop-by-hop routing similar to distant vector based 
protocols such as RIP [Mal94]. Nevertheless, there are no periodical route ad- 
vertisements. Instead, a route is created only if it is demanded by data traffic 
with no available routes [PBRD03]. 



3.2 The AODV EFSA Specification 

Our AODV EFSA is based on the AODV state machine from Bhargavan et al.’s 
work [BGK+02]. It is shown in Figures I and 2. 

Each EFSA contains two sub graphs. The second sub graph (Figure 2) is 
only in use within a certain period after a node has rebooted. After all other 
nodes have updated their routing entries accordingly, normal routing operations 
resume and the other graph (the normal sub graph. Figure 1) is used. The two 
sub graphs are shown separately for a better layout. 

Note that we only capture major AODV functionalities in the EFSA. Some 
specified protocol behavior relies on information from other layers, which we 
cannot model for now. 

The routing behavior in AODV is defined for every single route entry or des- 
tination. In other words, there is a unique EFSA for each destination host. We 
use the abbreviation ob to specify the destination, which stands for an observed 
node. We define EFSA(o&) as the corresponding EFSA of ob. In addition, there 
is a special EFSA, EFSA(cur), where cur is a global variable that defines the 
node’s IP address. We create this special EFSA specifically to reply Route Re- 
quests for the current node. Thus, for each node, we have a total of n-l- 1 EFSAs 
where n is the number of entries in the node’s routing table. That is, n instances 
of EFSA(o&), one for each destination, and one instance of EFSA(cur). 

Timeouts, predicates, packet-receiving events, packet-delivery events, tasks 
and auxiliary functions are further explained below. Note that a predicate or a 
packet-receiving event ends with ‘?’, while a packet-delivery event ends with ‘!’. 
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Fig. 1. AODV Extended Finite State machine (o6): In Normal Use 
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Fig. 2. AODV Extended Finite State machine (ob): After Reboot 
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Timeouts: 

1. DELETE_PERIOD: specify how long an invalidated route should re- 
main in memory. 

2. ACTIVE_ROUTE_TIMEOUT: specify how long before a valid route 
should be invalidated due to inactivity. 

3. NET_TRAVERSAL_TIME: specify the maximal round trip time af- 
ter a RREQ has been sent and before the corresponding RREP is re- 
ceived. 

Predicates (the expected behavior): 

1. noduplicate? [Src, ID]: return true if RREQ from b'rc with RREQ ID is 
not seen before. The pair is then cached and can be used for comparison 
in later calls. 

2. route_invalidated?[Dst] : return true if a route to Dst has been inval- 
idated due to link loss or incoming RERR, etc. 

Packet-receiving events: 

1. DATA? [Src, Dst]: return true if there is an incoming data packet that 
was originated from Src, and is destined to Dst. 

2. RREQ?[Prev, Src, Src_Seq, Dst, Dst_Seq, Hops, ID]: return true 
if a RREQ message has been received and it contains the following fields. 
The originator is Src with sequence number SrcSeq and a unique RREQ 
ID. The destination is Dst with sequence number Dst_Seq. The number 
of hops from Src is Hops. Finally, the Prev field specifies the address of 
the previous hop. Although not shown in the outgoing RREQ! event, 
this field can be found in the incoming packet’s IP header. 

3. RREP?[Prev, Src, Dst, Dst_Seq, Hops]: return true if there is an 
incoming RREP and named fields match the specified parameters (sim- 
ilar to RREQ?, except that Hops here represents the hops to Dst). 

4. RERR? [Src, Dst, Dst_Seq]: return true if an incoming RERR mes- 
sage was sent by Src, and includes Dst in its unreachable destination list, 
with sequence number Dst_Seq. 

Packet-delivery events: 

1. DATA! [Src, Dst, Next]: forward a data packet that was originated 
from Src and is destined to Dst, to the next hop Next. 

2. RREQ! [Src, Src_Seq, Dst, Dst_Seq, Hops, ID]: broadcast RREQ 
with supplied fields. 

3. RREP![Next, Src, Dst, Dst_Seq, Hops]: deliver RREP. We explic- 
itly specify Next here since RREP, different from RREQ, is not broad- 
cast. 

4. RERR![Dsts]: deliver RERR with the list of unreachable destinations 
in Dsts. Corresponding sequence numbers of these destinations are also 
included in Dsts. 

Tasks (the expected behavior): 

1. save_buffer(Dst, DATA): buffer the data with destination Dst. 

2. flush_buffer(Dst, Next): deliver all packets in the data buffer with 
destination Dst through Next, and removes them from the data buffer. 
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3. clear _buffer(Dst): remove all data with destination Dst from the data 
buffer. 

— Auxiliary functions: 

1. packet?(Dst): return true if there is an incoming packet destined to Dst. 
It is a shorthand of DATA? Dst] || RREQ?[-,-,-, Dst, || RREP?[-,- 
,Dst,-,-j II RERR?[Dsts] && Dst € Dsts. 

2. better?([seql, hopl],[seq2, hop2|): return true if (seql > seq2 || 
seql==seq2 && hopl < hop2 || seq2 is unknown ). 

3. extend(Dst): return a list of unreachable destinations (with their se- 
quence numbers) due to a broken link to Dst. Routes to these destinations 
include Dst as their next hop. Obviously, Dst G extend(Dst). 

4. continue: do not stop in the new state after a transition. Instead, at- 
tempt to make another state transition from the new state. 

4 Design of an Intrnsion Detection System for AODV 

Before we analyze design issues of an Intrusion Detection System (IDS) for 
AODV, we make the following assumptions: 1) IDS should have access to in- 
ternal routing elements, such as routing table entries. Currently, we modify the 
AODV implementation in our testbed to store routing table entries in a shared 
memory block, so that other processes can access them. In the future, hardware 
assistance may be necessary to achieve this; 2) IDS should also have the capa- 
bility of intercepting incoming and outgoing packets, including data and routing 
messages. 

Statistical-based detection technique, equipped with machine learning tools, 
can be used to detect abnormal patterns. It has the potential advantage of 
detecting unknown attacks. But it usually comes with a high false alarm rate. 
Its detection performance heavily depends on selected features. 

In contrast, specification-based techniques use specifications to model legit- 
imate system behavior and do not produce false alarms. However, developing 
specification is time consuming. Furthermore, many complex attacks do not vi- 
olate the specification directly and cannot be detected using this approach. 

Our detection approach combines the advantages of both techniques. Conse- 
quently, we separate anomalous basic events into two sets, events that directly 
violate the semantics of EFSAs, and events that require statistical measures. 



4.1 Detection of Specification Violations 

Some anomalous basic events can be directly translated into violations of EF- 
SAs. We identify three types of violations: Invalid State Violation, Incorrect 
Transition Violation and Unexpected Action Violation. 

Invalid State Violation involves a state that does not appear to be valid in 
the specification. In our specification, an invalid state means the combination of 
state variables in the current state is invalid according to the specification. For 
example, a state with a negative hop count is considered an invalid state. In our 
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implementation, we keep a copy of state variables every five seconds. Thus, we 
can track invalid changes in state variables. 

Incorrect Transition Violations occur if invalid transitions are detected. We 
verify the proper transition by comparing possible input conditions on all tran- 
sitions from the current state. If a state change occurs while no input conditions 
can be met, this type of violation is detected. In addition, there are self-looping 
transitions that do not change the current state. For these transitions, we exam- 
ine output actions. If some of these output actions (which include packet delivery 
events and state variable modifications) are detected while corresponding input 
conditions do not match, we also identify this type of violation. Our implemen- 
tation monitors incoming and outgoing traffic to determine if input conditions 
and output actions are properly handled. 

Unexpected Action Violation corresponds to the situation when the input 
condition during a transition matches and the new state is as expected, but the 
output action is not correctly or fully performed. 

We show that the specification-based approach can detect the following 
anomalous basic events: 

Interruption of Data Packets: We monitor the transition TIO, where data is for- 
warded when a valid route is available. An attacker interrupts data packets 
by receiving but not forwarding data. It is observed as a type of Unexpected 
Action Violation in the transition. 

Interruption of Routing Messages: An attacker may choose to interrupt certain 
types of routing messages by conducting the corresponding transition but 
not actually sending the routing packets. For more details. Route Request 
messages are delivered in transition T4, T5, or T5’; Route Reply messages 
are delivered in T9 or Til; Route Error messages are delivered in TR2, 
TR5, T6, T12 or T12’. They can always be identified as Unexpected Action 
Violations in the corresponding transition. 

Add Route of Routing Table Entries: We monitor state change to the state when 
a route to ob becomes available (state Valid) from other states. If it does 
not go through legitimate transitions (which include T7, T8, T7’ and T8’), 
it implies that a new route is created bypassing the normal route creation 
path. It is an Incorrect Transition Violation in these transitions. 

Delete Route of Routing Table Entries: Similarly, we monitor state change in a 
reverse direction, i.e., from a valid state (state Valid) to a state when a route 
becomes unavailable (state Invalid). If it does not go through legitimate 
transitions (T12 and T12’), it is detected as an Incorrect Transition Violation 
of these transitions. 

Change Route Cost of Routing Table Entries: We can identify changes in sequence 
numbers or hop counts to the routing table using the memorized state vari- 
able copy, when a valid route is available (state Valid). They are examples 
of Invalid State Violations. 

Fabrication of Routing Messages: Currently, our approach can identify a special 
type of Fabrication of Routing Messages, namely. Route Reply Fabrication. 
We examine the transitions that deliver Route Reply messages (transitions 
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T9 and Til). If the output actions are found but the input conditions do not 
match, we will identify an Incorrect Transition Violation in these transitions, 
which is an indication that outgoing routing messages are in fact fabricated. 

To summarize, we define a violation detection matrix. It maps violation infor- 
mation (the violated transition(s) or state and the violation type) to an anoma- 
lous basic event. The matrix is shown in Table 4. It can be used to detect attacks 
that directly violate the AODV specification where we can identify the corre- 
sponding types of anomalous basic events. Detection results are summarized in 
Section 4.3. 



Table 4. Violation Detection Matrix in AODV 



State or 
Transition(s) 


Invalid State 
Violation 


Incorrect Transition 
Violation 


Unexpected Action 
Violation 


TR2, TR5, T6 






Interruption of 
Route Errors 


T4, T5, T5’ 






Interruption of 
Route Requests 


T7, T8, T7’, T8’ 




Add Route 




T9, Til 




Fabrication of Route 
Replies 


Interruption of 
Route Replies 


TIO 






Interruption of Data 
Packets 


T12, T12’ 




Delete Route 


Interruption of 
Route Errors 


Valid 


Change Route Cost 







4.2 Detection of Statistical Deviations 

For anomalous events that are temporal and statistical in nature, statistical 
features can be constructed and applied to build a machine learning model that 
distinguishes normal and anomalous events. RIPPER [Coh95], a well-known rule 
based classifier, is used in our experiments. 

We first determine a set of statistical features based on activities from anoma- 
lous basic events that cannot be effectively detected using the specification-based 
approach. Features are computed periodically based on the specified statistics 
from all running EFSAs, and stored in audit logs for further inspection. To build 
a detection model, we use a number of off-line audit logs (known as training data) 
which contain attacks matching these anomalous basic events. Furthermore, each 
record is pre-labeled with the type of the corresponding anomalous basic event 
(or normal if the record is not associated with any attacks) because we know 
which attacks are used. They are processed by RIPPER and a detection model 
is generated. The model is a set of detection rules. The model is then used to 
detect attacks in the test data. 

Using the taxonomy of anomalous basic events in Table 1, we identify the 
following anomalous basic events that remain to be addressed, because they can- 
not be detected in the specification-based approach. For each type of anomalous 
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basic event, we discuss what features are needed to capture its behavior. All 
features are defined within a sampling window. We use a sampling window of 
five seconds in all cases. In addition, features are normalized in a scale of 0 to 
50. 

Flooding of Data Packets: In order to capture this anomalous event, we need to 
capture the volume of incoming data packets. In AODV, data packets can 
be accepted under three different situations: when a valid route is available 
(which is transition TIO), when a route is unavailable and no route request 
has been sent yet (transition T2) or when a route is unavailable and a route 
request has been sent to solicit a route for the destination (transition T3). 
Accordingly, we should monitor frequencies of all these data packet receiving 
transitions. We define three statistical features, Datal, Data2, and DataS, 
for each transition (TIO, T2 and T3) respectively. 

Flooding of Routing Messages: Similarly, we need to monitor the frequencies of 
transitions where routing messages are received. However, a larger set of 
transitions need to be observed because we need to take into account of 
every type of routing messages (which include 15 transitions, T5, T5’, T7, 
T8, T7’, T8’, T7”, T8”, T9, Til, TR3, TR4, TR3’, TR4’, and TR6). In order 
not to introduce too many features, we use an aggregated feature Routing 
which denotes the frequency of all these transitions. Note that it is not the 
same as monitoring the rate of incoming routing messages. An incoming 
routing message may not be processed by any EFSA in a node. We need 
only to consider messages that are being processed. 

Modification of Routing Messages: Currently, we consider only modifications to 
the sequence number field. We define Seq as the highest destination sequence 
number in routing messages during transitions where they are received (see 
above for the transitions involved in routing messages). 

Rushing of Routing Messages: We monitor two features where some typical rout- 
ing process may be rushed. Rushingl is the frequency of the transition where 
a route discovery process fails because the number of Route Requests sent 
has exceeded a threshold (RREQ_RETRIES) or certain timeout has elapsed 
(NET_TRAVERSAL_TIME in transition T6) . Rushing2 is the frequency of 
the transition where a Route Request message was received and it is replied 
by delivering a Route Reply message (transition Til). 



4.3 Experiments and Results 

Environment: We use MobiEmu [ZL02] as the evaluation platform. MobiEmu 
is an experimental testbed that emulates MANET environment with a local 
wired network. Mobile topology is emulated through Linux’s packet filtering 
mechanism. Different from many simulation tools, MobiEmu provides a scal- 
able application-level emulation platform, which is critical for us to evaluate 
the intrusion detection framework efficiently on a reasonably large network. We 
use the AODV-UIUC implementation [KZG03], which is designed specifically to 
work with the MobiEmu platform. 
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Experiment Parameters: The following parameters are used throughout our ex- 
periments. Mobility scenarios are generated using a random way-point model 
with 50 nodes moving in an area of 1000m by 1000m. The pause time between 
movements is 10s and the maximum movement speed is 20.0m/s. Randomized 
TCP and UDP/CBR (Constant Bit Rate) traffic are used but the maximum 
number of connections is set to 20; and the average traffic rate is 4 packets per 
second. These parameters define a typical MANET scenario with modest traffic 
load and mobility. They are similar to the parameters used in other MANET 
experiments, such as [PRDM01,MBJJ99,MGLB00]. Nevertheless, we have not 
systematically explored all possible scenarios (for instance, with high mobility 
or under high traffic load). We plan to address this issue in our future work. 

We test our framework with multiple independent runs. A normal run con- 
tains only normal background traffic. An attack run, in addition, contains mul- 
tiple attack instances which are randomly generated from attacks specified in 
Tables 2 and 3 or a subset according to certain criteria. 

We use ten attack runs and two normal runs as the test data, each of which 
runs 100,000 seconds (or 20,000 records since we use a sampling window of five 
seconds). In each attack run, different types of attacks are generated randomly 
with equal probability. Attack instances are also generated with random time 
lengths, but we guarantee that 80% of total records are normal. It is a relatively 
practical setting considering that normal events should be the majority in a real 
network environment. We use normal data in normal runs and attack runs to 
evaluate false alarm rates. 

Detection of Specification Violations: The following attacks are detected in the 
test data as direct violations of the EFSA, which verifies our previous analysis 
that these attacks match anomalous basic events that can be directly detected 
by verifying the specification. For complex attacks, a different network size may 
be used if appropriate. Note that detection rates are 100% and false alarm rates 
are 0% for attacks when the specification-based approach is used. 

Data Drop (R | S | D): detected as Interruption of Data Packets. 

Route Drop (R | S | D): detected as Interruption of Routing Messages. 

Add Route (I | N): detected as Add Route of Routing Table Entries. 

Delete Route: detected as Delete Route of Routing Table Entries. 

Change Sequence (R | M); Change Hop: detected as Change Route Cost of Rout- 
ing Table Entries. 

Active Reply; False Reply: detected as Route Reply Fabrication. 

Route Invasion; Route Loop: They are detected since they use fabricated routing 
messages similar to what the Active Reply attack does. In particular. Route 
Invasion uses Route Request messages, and Route Loop uses Route Reply 
messages. With the same set of transitions in Route Drop, we can detect 
them as Incorrect Transition Violations in Route Request or Route Reply 
delivery transitions. 

Partition: This attack can be detected since it uses a fabricated routing message 
{Route Reply) and interrupts data packets. Therefore, monitoring the transi- 
tions related to Route Reply (as in Route Drop), and the transition related 




Attack Analysis and Detection for Ad Hoc Routing Protocols 141 



to data packet forwarding (TIO, as described in Data Drop), we can detect 
this attack with the following violations identified: Incorrect Transition Vi- 
olation in Route Reply delivery transitions and Unexpected Action Violation 
in the data forwarding transition. 

Detection of Statistical Deviations: Some attacks are temporal and statistical 
in nature and should be detected using the statistical approach. The following 
are four representative examples of such attacks: Data Flooding (S | D | R); 

Route Flooding (S | D | R); Modify sequence (R | M); Rushing (F | 

Y). 

Four attacks data sets, each of which contains an attack run of 25,000 seconds 
(or 5,000 records), are used to train the detection model. Each data set con- 
tains attacks that match to one type of anomalous basic event. Attack instances 
are generated in such a way that the number of abnormal records accounts for 
roughly 50% of total records, instead of 80% in the case of test data. It helps 
improve detection accuracy by using approximately the same amount of normal 
and abnormal data. We train separately with each training data set. The same 
test data set is used to evaluate the learned model. 

Table 5. Detection and False Alarm Rates of the Statistical-based Approach 

(b) Detection and False Alarm Rates of 
(a) Attack Detection Rates Anomalous Basic Events 



Attack 


Detection rate 


Data Flooding (S) 


93±3% 


Data Flooding (D) 


91 ±4% 


Data Flooding (R) 


92±4% 


Route Flooding (S) 


89±3% 


Route Flooding (D) 


91±2% 


Route Flooding (R) 


89±3% 


Modify sequence (R) 


59±19% 


Modify sequence (M) 


o 

o 

o 


Rushing (F) 


91±3% 


Rushing (Y) 


85±4% 



Anomalous 
Basic Event 


Detection 

Rate 


False Alarm 
Rate 


Flooding of 
Data Packets 


92±3% 


5±1% 


Flooding of 

Routing 

Messages 


91±3% 


9±4% 


Modification 
of Routing 
Messages 


79±10% 


32±8% 


Rushing of 

Routing 

Messages 


88±4% 


14±2% 



The detailed detection results are shown in Table 5. We show the detection 
rates of tested attacks (in Table 5(a)). We consider a successful detection of an 
attack record if and only if the corresponding anomalous basic event is correctly 
identified. We also show the detection and false alarm rates (in Table 5(b)) 
directly against anomalous basic events. We analyze these results for each type 
of anomalous basic event below. 

Flooding of Data Packets and Routing Messages: We implement flooding as traf- 
fic over 20 packets per second. For flooding of data packets, 92% can be 
detected. They are detected by observing abnormally high volume on at 
least one of related statistics, Datal, Data2, or DataS. Similar results are 
also observed for flooding of routing messages. 
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Modification of Routing Messages: The corresponding detection result is not very 
satisfactory. It shows high variations in both the detection and false alarm 
rates. In fact, the corresponding detection rule assumes that this anoma- 
lous basic event can be predicted when at least some incoming packet has 
a sequence number larger than certain threshold. It is not a rule that can 
be generally applied. Randomly generated sequence numbers may only be 
partially detected as attacks. We further discuss problem in the end of this 
section. In contrast, for a special type of sequence modification {Modify Se- 
quence (M)), the detection rate is perfect. Because we know that it is very 
rare for the largest sequence number to appear in the sequence number field 
of routing messages. 

Rushing of Routing Messages: Detection performance varies significantly on dif- 
ferent rushing attacks, namely. Rushing (F) and Rushing (Y). In Rushing 
(F), the attacker tries to shorten the waiting time for a Route Reply mes- 
sage even if a route is not available yet. Because more requests to the same 
destination may follow if route discovery was prematurely interrupted, the 
attack results in abnormally high frequency where the route discovery pro- 
cess is terminated {Rushingl). In Rushing (Y), the attacker expedites Route 
Reply delivery when a Route Request message has been received. It can be 
captured because the corresponding transition (TII) now occurs more fre- 
quently than a computed threshold {Rushing2). Nevertheless, we also observe 
significant false alarms in detecting these attacks. It results from irregularity 
of route topology change due to MANET’s dynamic nature. Some normal 
nodes may temporarily suffer a high route request volume that exceeds these 
thresholds. 

Discussion: Comparing with the taxonomy of anomalous basic events in Table 1, 
we realize that a few of them cannot be detected effectively yet. First, we cannot 
detect Route Message Modification with incoming packets in which the 
modification patterns are not known in advance. We identify the problem as 
it requires knowledge beyond a local node. However, these attacks can usually 
be detected using other security mechanisms or by other nodes. If the message 
comes from external sources, it may be successfully prevented by a cryptographic 
authentication scheme. Otherwise (i.e., it was delivered by the routing agent from 
another legitimate node) , the IDS agent running on that node may have detected 
the attack. In addition. Rushing attacks cannot be detected very effectively, 
especially when features beyond the routing protocol, such as delays in the MAC 
layer, are involved. Our system can be improved if we were able to extend our 
detection architecture across multiple network layers. It is part of our future 
work. 

5 Related Work 

Many cryptographic schemes have been proposed to secure ad hoc routing proto- 
cols. Zapata [ZapOl] proposed a secure AODV protocol using asymmetric cryp- 
tography. Hu et al. [HPJ02] proposed an alternative authentication scheme based 
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on symmetric keys to secure the DSR protocol [JMBOl], because public key 
computation appears too expensive for MANET nodes with limited power and 
computation capabilities. As we have demonstrated, protection approaches are 
suitable for a certain class of security problems. Intrusion detection approaches 
may be more suitable to address other problems. 

Vigna and Kemmerer [VK98] proposed a misuse intrusion detection sys- 
tem, NetSTAT, which extends the original state transition analysis technique 
(STAT) [IKP95]. It models an attack as a sequence of states and transitions 
in a finite state machine. Whereas in our work, finite state machines are mod- 
eled for normal events. Specification-based intrusion detection was proposed by 
Ko et al. [KRL97] and Sekar et al. [SGF+02]. Specification-based approaches re- 
duce false alarms by using manually developed specifications. Nevertheless, many 
attacks do not directly violate specifications and thus, specification-based ap- 
proaches cannot detect them effectively. In our work, we apply both specification- 
based and statistical-based approaches to provide better detection accuracy and 
performance. 

Bhargavan et al. [BGK+02] analyzed simulations of AODV protocols. Their 
work included a prototype AODV state machine. Our AODV EFSA is based on 
their work but has been heavily extended. Ning and Sun [NS03] also studied the 
AODV protocol and used the definition of atomic misuses, which is similar to 
our definition of basic events. However, our definition is more general because 
we have a systematic study of taxonomy of anomalous basic events in MANET 
routing protocols. 

Recently, Tseng et al. [TBK+03] proposed a different specification-based de- 
tection approach. They assume the availability of a cooperative network monitor 
architecture, which can verify routing request-reply flows and identify many at- 
tacks. Nevertheless, there are security issues as well in the network monitor 
architecture which were not clearly addressed. 



6 Conclusion and Future Work 

We proposed a new systematic approach to categorize attacks. Our approach 
decomposes an attack into a number of basic events. We showed its use in attack 
taxonomy analysis. In addition, protocol specifications can be used to model 
normal protocol behavior and can be used by intrusion detection systems. By 
applying both specification-based and statistical-based detection approaches, we 
have the advantages of both. Specification-based approach has no false alarm, 
statistical-based approach can detect attacks that are statistical or temporal in 
nature. 

We proposed a taxonomy of anomalous basic events in MANET routing pro- 
tocols and presented a case study of the AODV protocol. We constructed an 
AODV extended finite state automaton specification. By examining direct vi- 
olations of the specification, and by constructing statistical features from the 
specification and applying machine learning tools, we showed that most anoma- 
lous basic events were detected in our experiments. 
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Future Work: We plan to enhance our framework by automatically extracting 
useful features for detection of unknown attacks. We also plan to design an intru- 
sion detection system across multiple network layers to detect more sophisticated 
attacks. 
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Abstract. Monitoring unused or dark IP addresses offers opportunities to sig- 
nificantly improve and expand knowledge of abuse activity without many of the 
problems associated with typical network intrusion detection and firewall sys- 
tems. In this paper, we address the problem of designing and deploying a system 
for monitoring large unused address spaces such as class A telescopes with 16M 
IP addresses. We describe the architecture and implementation of the Internet 
Sink (iSink) system which measures packet traffic on unused IP addresses in an 
efficient, extensible and scalable fashion. In contrast to traditional intrusion de- 
tection systems or firewalls, iSink includes an active component that generates 
response packets to incoming traffic. This gives the iSink an important advan- 
tage in discriminating between different types of attacks (through examination 
of the response payloads). The key feature of iSink’s design that distinguishes it 
from other unused address space monitors is that its active response component 
is stateless and thus highly scalable. We report performance results of our iSink 
implementation in both controlled laboratory experiments and from a case study 
of a live deployment. Our results demonstrate the efficiency and scalability of 
our implementation as well as the important perspective on abuse activity that is 
afforded by its use. 

Keywords: Intrusion Detection; Honeypots; Deception Systems 

1 Introduction 

Network abuse in the form of intrusions by port scanning or self propagating worms is a 
significant, on-going threat in the Internet. Clever new scanning methods are constantly 
being developed to thwart identification by standard firewalls and network intrusion de- 
tection systems (NIDS). Work by Stamford et al. [27] and by Moore et al. [18] project 
and evaluate the magnitude of the threat of new classes of worms and the difficulty of 
containing such worms. The conclusions of both papers is that addressing these threats 
presents the research and operational communities with serious challenges. An impor- 
tant step in protecting networks from malicious intrusions is to improve measurement 
and detection capabilities. 

One means for improving the perspective and effectiveness of detection tools is to 
monitor both used and unused address space in a given network. Monitoring the un- 
used addresses is not typically done since packets destined for those addresses are often 
dropped by a network’s gateway or border router. However, tracking packets sent to 
unused addresses offers two important advantages. First, other than misconfigurations, 
packets destined to unused addresses are almost always malicious, thus false positives 
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- a significant problem in NIDS - are minimized. Second, unlike NIDS that monitor 
traffic passively, a detection tool that monitors unused addresses can actively respond 
to connection requests, thus enabling the capture of data packets with attack-specific 
information. The possibility for unused address space monitoring is perhaps most sig- 
nificant in class A and class B networks where the number of unused addresses is often 
substantial. The idea of monitoring unused address space has been adopted in a num- 
ber of different studies and on-going projects including the DOMINO project [31], the 
Honeynet project [29], LaBrea tarpits [14] and in the hackscatter analysis conducted by 
Moore et al. in [19]. 

This paper makes two contributions. The first is our description of a new system 
architecture and implementation for measuring IP traffic. An Internet Sink or iSink, is 
a system we developed for monitoring abuse traffic by both active and passive means. 
The key design requirements of an iSink are extensibility of features and scalability 
of performance since it is meant to be used to monitor potentially large amounts of IP 
address space. 

Our design of an iSink includes capabilities to trace packets, to actively respond 
to connection requests, to masquerade as several different application types, to finger- 
print source hosts and to sample packets for increased scalability. The passive compo- 
nent of our implementation (which we call Passive Monitor) is based on Argus [3] - a 
freely available IP flow measurement tool. The active component of our implementation 
(which we call Active Sink) is based on the Click modular router platform [12]. Click is 
an open-source toolkit for building high performance network systems on commodity 
hardware. The focus of Active Sink’s development was to build a set of stateless respon- 
der elements which generate the appropriate series of application level response packets 
for connections that target different network services including HTTP, NetBIOS/SMB 
and DCERPC (Windows RPC Service). 

The second contribution of this paper is a measurement and evaluation case study 
of our iSink implementation. We use the results from the case study to demonstrate the 
scale and diversity of traffic characteristics exposed by iSink-based monitoring. These 
results provide validation of our architectural requirements and rationale for subsequent 
evaluation criteria. We also deployed the iSink in situ to monitor four class B address 
spaces within our campus network for a period of 4 months and one entire class A ad- 
dress space to which we have access. From these data sets we report results that demon- 
strate the iSink’s capabilities and the unique information that can be extracted from this 
measurement tool. One example is that since the traffic characteristics from our class 
B monitor are substantially different from those on the class A monitor, we conclude 
that the location of the iSink in IP address space is important. Another example is that 
we see strong evidence of periodic probing in our class A monitor which we were able 
to isolate to the LovGate worm [2]. We also uncovered an SMTP hot-spot within the 
class A network that has been unreported prior to our study. We were able to attribute 
that anomaly to misconfigured wireless routers from a major vendor. Finally, we assess 
basic performance of the iSink in controlled laboratory experiments and show that our 
implementation has highly scalable response capability. 

These results demonstrate that our iSink architecture is able to support a range of 
capabilities while providing scalable performance. The results also demonstrate that 
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Active iSinks are a simple and very useful way to extend basic intrusion monitoring 
capabilities in individual networks or in the Internet as a whole. 



2 Related Work 

The notion of monitoring unused IP addresses as a source of information on intrusions 
has been in use in various forms for some time. While we coin the terms “Internet Sink” 
and “iSink”, these monitors have variously been referred to as “Internet Sink-holes” 
[8], “Blackhole Routers” [9] and “Network Telescopes” [15]. Traditional Honeypots 
are defined as systems with no authorized activity that are deployed with the sole pur- 
pose of monitoring intrusions. Honeynets are network of honeypots (typically set up as 
VMware hosts). Their deployment is often associated with significant management and 
scalability challenges [29]. In [15], Moore raises the challenges of deploying honeypots 
in a class A network telescope. The systems that are perhaps most similar to the Active 
Sink have been developed in the Honeyd [10] and Labrea Tarpit projects [14]. Active 
Sink’s design differs in significant ways from these two systems. Much like the Ac- 
tive Sink, Honeyd is designed to simulate virtual honeypots over unused IP addresses, 
with the potential for a diverse set of interactive response capabilities. However, Hon- 
eyd’s stateful active responder design has significant scalability constraints that make it 
inappropriate for monitoring large IP address ranges which is one of iSinks primary ob- 
jectives. LaBrea’s primary design objective is to slow the propagation of Internet worms 
{i.e., a sticky honeypot), and as such, it lacks the richness of interaction capabilities that 
is required to gather important response information. In addition to a richer response 
set, our Active Sink’s performance greatly exceeds that of LaBrea as will be seen in 
Section 5. 

There are a number of empirical studies of intrusion and attack activity that moti- 
vate and inform our work. In [33], the authors explore the statistical characteristics of 
Internet intrusion activity from a global perspective. That study is based on the use of 
intrusion logs from NIDS and firewalls located broadly across the Internet. Moore et al. 
examined the global prevalence of denial-of- service attacks using backscatter analysis 
in [19]. That work was conducted by gathering packet traces from a relatively quies- 
cent class A network. Characteristics of the Code Red worm have been analyzed in a 
number of studies. In [17] the authors investigate the details of the Code Red outbreak 
and provide important perspective on the speed of worm propagation. Moore et al. pro- 
vide further insights on the speed at which countermeasures would have to be installed 
to inhibit worms propagation [18]. While the prospects for successful containment are 
rather grim, it is clear that rapid detection will be a key component in any quarantine 
strategy. 

Intrusion detection systems are a standard component in network security architec- 
tures. These tools typically monitor packet traffic at network ingress/egress points and 
identify potential intrusions using a variety of techniques. Standard methods for intru- 
sion identification include misuse detection {eg. [21,25]), statistical anomaly detection 
{eg. [26]), information retrieval {eg. [1]), data mining {eg. [13]), and inductive learning 
{eg. [28]). Our work is distinguished from general NIDS in that they operate on active 
IP addresses and must deal with the problem of identifying the nefarious traffic mixed 
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in with all of the legitimate traffic. We expect iSinks and NIDS to complement each 
other in future operational environment. 

High performance packet monitors have been used for collecting packet traces in 
the Internet for years. These systems relate directly to our iSink design in that they 
must scale to reliably log packets on very high speed links. Examples of these include 
systems that have been developed with a variety of commodity and special purpose 
hardware such as [4,7, 11]. Our iSink differs significantly from these systems (as well 
as the NIDS mentioned above) in that it not only passively monitors and logs packets, 
but it also actively responds to incoming TCP connection requests and has application 
level response capability. 

3 Internet Sink Architecture 

In this section we describe the iSink requirements, architecture and implementation. 
The implementation is described within the context of deployments on two different 
sets of address spaces. 

3.1 Design Requirements 

The general requirements for an iSink system are that it possess scalable capability 
for both passive and active monitoring and that it be secure. We discuss the issues of 
security in more detail in [32]. 

Passive monitoring capability must be able to capture packet header and payload 
information accurately. While there are many standard tools and method for packet 
capture, if either these or new tools are employed, they should be flexible and efficient 
in the ways in which data is captured and logged. 

Active response capability is included in iSink’s design as a means to gather more 
detailed information on abuse activity. This capability is enabled by generating appro- 
priate response packets (at both transport and application levels) to a wide range of 
intrusion traffic. While active responses also have the potential to interfere with mali- 
cious traffic in beneficial ways such as tarpitting, this is not a focus of iSink’s design. 

We expect Internet Sinks to measure abuse activity over potentially vast unused 
IP address spaces. For example, in our experimental setup, we needed the ability to 
scale to an entire class A network (16 million addresses). With the continued growth 
in malicious Internet traffic, and transition to IPv6, we expect the scalability needs to 
grow significantly for both the active and passive components of our system. Our basic 
approach to scalability is to maintain as little state as possible in our active responders. 
Another means for increasing scalability is through the use of sampling techniques in 
both active and passive components of the system. If sampling is employed, then the 
measurement results must not be substantially altered through their use. 

Finally, our intent is to develop iSink as an open platform, thus any systems that are 
used as foundational components must be open source. 

3.2 Active Response: The Design Space 

In this section we explore the architectural alternatives for sink-hole response systems. 
The choices we consider are FaBrea, Honeyd, Honeynets and Active Sink (iSink’s ac- 
tive response system) as shown in Table 1. We compare these systems based on the 
following characteristics. 
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Table 1. Design Space of Sink-Hole Responders 





Configurability 


Modularity 


Flexibility 


Interactivity 


Scalability 


Active Sink 


High 


High 


High 


Low-Medium 


High 


Honeyd 


High 


Low-Medium 


High 


Low-Medium 


Low-Medium 


Honeynet 


Low 


Medium 


Medium 


High 


Low-Medium 


LaBrea 


Low 


Low 


Low 


Limited 


High 



1. Configurability describes the ability of the configuration language to define the 
layout and components of response networks. Honeyd’s strengths are in fine- 
grained control of virtual network topologies and network protocol stacks. How- 
ever Honeyd’s language does not provide support for assigning large blocks of IP 
addresses to templates (except for the default template)'. Active Sink’s configura- 
tion language (inherited from Click) uses a BPF like language and provides excel- 
lent support for both fine-grained and coarse-grained control of a virtual network 
topology. Active Sink’s design is stateless and hence does not replicate network 
stack retransmission timers. LaBrea and Honeynets only allow for limited config- 
urability. 

2. Flexibility relates to the ability to mix and match services with operating systems. 
For example, the ability to define two types of Windows Servers: one with a telnet 
service and FTP service and another with NetBIOS Service and a Web server. The 
design of Honeyd and Active Sink both provide a high degree of flexibility. It is 
somewhat harder to do the same with Honeynets. LaBrea’s flexibility in this regard 
is limited as it was designed with a different objective. 

3. Modularity describes the ability to compose and layer services on top of one an- 
other. For example, layering Server Message Block (SMB) service over NetBIOS 
or layering Web services over SSL. Active Sink’s design is inherently modular 
which directly facilitates service composition. In contrast, Honeyd’s design is more 
monolithic and hence less straightforward to layer services. 

4. luteractivity refers to the scope of response capability. The levels of interactivity 
of Honeyd and Active Sink are comparable. Obviously, Honeynets could provide 
more complete response capabilities. However, to mitigate the risk of Honeynets 
being used as a stepping-stone for additional attacks, data controls are required to 
be placed which limit interactivity. There are other practical configuration issues 
that also could limit interactivity. For example, Active Sink’s NetBIOS responder 
grants session requests for all NetBIOS names and all user/password combinations, 
while a Honeynet Windows monitor would only allow NetBIOS session requests if 
it matches its list of valid names. Hence, the realized degree of interaction in Active 
Sinks are often higher than honeynets. 

5. Scalability refers to the number of connections that can be handled in a given time 
period. In our monitoring environment we typically see hundreds of thousands of 
connection attempts per minute. Active Sink’s stateless kernel module design pro- 
vides high degree of scalability by eliminating unnecessary system calls and inter- 



* This feature is particularly necessary for large network sinks. 
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rupt handling overheads^. LaBrea’s stateless design also provides reasonable scal- 
ing properties, however its user level implementation makes it inferior to the Active 
Sink. A weakness of Honeyd’s design is its inherent statefulness that limits its seal- 
ability^. Our experience suggests that Honey d works well in environments that see 
tens of connection attempts per minute. The scalability of Honeynet systems vary 
from low to medium depending on the service and licensing issues. 

3.3 Implementation 

The objective of our monitoring infrastructure implementation was to create a highly 
scalable backplane with sufficient interactivity to filter out known worms, attacks and 
misconfiguration. To accomplish this, the iSink design includes a Passive Monitor, an 
Active Sink and a Honeynet component. Unsolicited traffic can be directed to each of 
these components which provide unique measurement capabilities. These components, 
in addition to MRTG [20] and FlowScan [23], were run on Linux-based commodity 
PCs. Details of our implementation as illustrated in Figure 1 and include: 

1 . Passive Monitor - This component is based on Argus which is a generic libpeap 
based IP network auditing tool. It allows for flow level monitoring of sink traffic and 
can be interfaced with FlowScan which is a flow level network traffic visualization 
tool. 

2. Active Sink - The standard collection of elements provided with Click enabled 
many of the basic capabilities required for building active responses in iSink. Fig- 
ure 2 illustrates iSink’s configuration based on Click’s modular design. Some of the 
fundamental elements include: (i) Poll Device which constantly polls the interface 
for new packets; (ii) IP Classifier which routes ARP packets to the ARP Respon- 
der, ICMP ping packets to the Ping Responder and TCP packets to the Windows 

^ Click also provides the flexibility to be mn as a userlevel module which greatly simplifies 
debugging and development. 

^ Honeyd forks a process per connection attempt. A more recent version of Honeyd includes 
support for python threads. However, scalability improvements are limited by the overhead of 
the python interpreter. 
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Responder (all other packets are discarded); (iii) Windows Responder which re- 
sponds to connection attempts on open ports and forwards HTTP requests to the 
Web Responder and SMB data packets to the NetBIOS Responder. The applica- 
tion responders developed specihcally for iSink are shaded. As far as we know, we 
are the first non-commercial Honeypot system to provide emulation capabilities for 
Windows Networking(NetBIOS/SMB/CIFS) and DCERPC. The current suite of 
responders that are available also includes an HTTP responder, an SMTP respon- 
der, an IRC responder, Dameware responder and a responder for backdoor ports 
such as MyDoom and Beagle. 

Stateless responders are enabled by the following two observations: 

(a) It is almost always possible to concoct a suitable response just by looking at 
the contents of the request packet from the client - even for complex protocols 
like SMB. Knowledge of prior state is not compulsory. 

(b) We need to continue the packet exchange only until the point where we can 
reliably identify the worm/virus. 

3. NAT Filter - The motivation behind hltering is to reduce the volume of traffic gen- 
erated by active responders. This module serves two purposes. It routes requests to 
appropriate responders (Active Sink or Honeynets) through network address trans- 
lation. It also biters requests that attempt to exploit known vulnerabilities or mis- 
conbguration. This makes mapping of iSinks more difficult and increases scala- 
bility of analysis daemons that have to process large volumes of data. We experi- 
mented with several bitering strategies: 

For each source IP allow only: 

(a) brst N connections 

(b) brst N connections per <destination port> 

(c) connections to brst N destinations IPs targeted by the source 

Of the three strategies, option (c) [N destination IPs per source IP] seemed the 
most attractive. The performance of options (a) and (c) were comparable. They 
both provided two orders of reduction in the volume of packets and bytes) and were 
signibcantly better than option(b). We chose option (c) because it has the additional 
advantage of providing a consistent view of the network to the scan sources thus 
allowing the iSink to appear as if it were a subnet with N live hosts'^. 

4. VMware Honeynets - These are, quite simply, commodity operating systems run- 
ning on VMware. Currently, we route packets of services for which we don’t have 
complete responders to fully patched Windows systems. 

5. NIDS - This system can be used to evaluate the packet logs collected at the biter. 
We plan to implement support for NIDS rules that can communicate with the bi- 
ter and implement real time bitering decisions. For example, the decision to route 
packets or migrate connection to VMware Honeynet could be triggered upon the 
absence of a signature in the NIDS ruleset for the connection. 

For this study, we built and deployed two separate iSinks: a “campus-enterprise” 
iSink and a “service-provider” iSink. These were used to assess our iSink design and 
demonstrate its capabilities. 

The set of N destination hosts varies with each source depending on the order in which the 
source scans the address space. 
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3.4 Deployment: Campus-Enterprise Sink 

The campus iSink received unsolicited traffic destined for approximately 100,000 un- 
used IPv4 addresses within 4 sparsely-to-moderately utilized class-B networks that are 
in use at our campus. Essentially, these unused addresses are in the “holes” between 
active subnets, each of which typically contains 128 to 1024 contiguous host addresses 
(i.e., 25 through 22-bit netmasks, respectively). 

A so called “black-hole” intra-campus router was configured to also advertise the 
class B aggregate /1 6 routes into the intra-campus OSPF. The result was that there were 
persistent less-specific (16 bit netmask) routes for every campus address. Unsolicited 
traffic, whether from campus or outside sources, destined for unused campus IP ad- 
dresses always “falls through” to those less-specific /1 6 routes, and therefore is routed 
to the iSink and measured. Furthermore, occasionally traffic destined for campus ad- 
dresses that are normally in use can fall through to the iSink if its subnet’s more specific 
route disappears. Typically, this only happens during network outages, making the iSink 
a potential warning system of problems because it can passively detect routing failures. 
Whenever traffic that was destined for a campus IP address known to be in use reaches 
the iSink instead, the operators know that there is a problem. 

It was important in our environment that the iSink machine was not capable of 
actively participating in the intra-campus routing, other than to respond via ARP as 
the IP nexthop on its transit link. The iSink is not an OSPF router, but instead is the 
destination of a static route. This limits the possible damage that could be caused if 
ever the iSink system was compromised and was attempted to be used maliciously. 



3.5 Deployment: Service-Provider Sink 

The service-provider iSink received unsolicited traffic destined for 16 million IPv4 ad- 
dresses in one class A network. An ISP router, located at our campus’ service-provider, 
served as the gateway for the service-provider iSink. The service-provider was respon- 
sible for advertising the class A network via BGP to our service provider’s commercial 
transit providers, Internet2’s Abilene network, and to various other peers. SNMP-based 
measurements at the Ethernet switch’s ports were used to compute any packet loss by 
the libpcap-based Argus software. 



4 Experiences with Internet Sink 

Investigating Unique Periodic Probes. The periodicity observed in the service 
provider iSink data is an excellent example of the perspective on intrusion traffic af- 
forded by iSink. The first step in our analysis of this periodicity was to understand the 
services that contributed to this phenomenon. We found that most of the periodicity 
observed in the TCP flows could be isolated to sources scanning two services (port 139 
and 445) simultaneously. Port 139 is SMB (Server Message Block protocol) over Net- 
BIOS and port 445 is direct SMB. However, this did not help us isolate the attack vector 
because it is fairly common for NetBIOS scanners to probe for both these services. Pas- 
sive logs provided three additional clues: 1) scans typically involve 256 successive IP 



154 



Vinod Yegneswaran, Paul Barford, and Dave Plonka 



Inbound Bits by IP Protocol Inbound Packets by IP Protocol 

Campus Network Sink (~100K Addresses) Campus Network Sink (-100K Addresses) 





Fig. 3. Inbound Traffic for a Typical Week on Campus-Enterprise Sink (bits/pkts per second) 



addresses that span a /24 boundary, 2) the probes had a period of roughly 2.5 hours, 
3) the small timescale periodicity seemed to be super imposed over a diurnal periodic 
behavior at larger timescales. 

Figure 6 shows the number of flows scanning both services in a week. To simplify 
our analysis we then focused on a single day’s data and classified scanners on these 
services based on their scan footprints. We defined scanners that match our profile (be- 
tween 250-256 successive IP addresses spanning a /24 boundary) as type-1 sources. We 
also defined sources that scan five or more subnets simultaneously as type-5 sources. 
This includes processes that pick destination IP addresses randomly and others that are 
highly aggressive. Figure 7 shows a time- volume graph of the type-1 and type-5 scan- 
ners. The interesting aspect of this flgnre is that the number of sources in each peak 
(around 100) is more than an order of magnitude smaller than the total number 
of participants observed in a day (2,177). We can also see that most of the diurnal 
behavior could be attributed to type type-5 sources. 

This mystery motivated our development of NetBIOS and SMB responders. By 
observing the packet logs generated by the active response system we concluded that the 
scanning process was the LovGate worm [2] which creates the file NetServices . exe 
among others. 

This section demonstrates iSink’s capabilities and illustrates the complementary 
roles of the Passive Monitor and the Active Sink using results from our two iSink de- 
ployments. We first discuss issues of perspective by comparing the passive-monitoring 
results observed in the campus-enterprise sink with that of the service-provider sink. We 
then demonstrate the utility of the Active Sink in investigating network phenomenon re- 
vealed by the Passive Monitor including periodic probing and SMTP hot-spots. 

4.1 Campus Enterprise iSink Case Study 

Because the campus iSink is located inside one autonomous system and advertised via 
the local interior routing protocol, this system sees traffic from local sources in addition 
to traffic from sources in remote networks. Traffic observed from local sources included: 

- Enterprise network management traffic attempting to discover network topology 

and address utilization (such as ping sweeps and SNMP query attempts) 
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Inbound Bits by IP Protocol Inbound Packets by IP Protocol 

Class- A Network Sink (16M Addresses) Class- A Network Sink (16M Addresses) 





Fig. 4. Inbound Traffic for a Typical Week on Service Provider Sink (bits/pkts per second) 



- Traffic from misconfigured hosts. For instance, a few hosts continually send domain 
queries to what is now an unused campus IP address. Presumably, an operational 
DNS server used to be at that address. We also see traffic from misconfigured AFS 
clients and NetBIOS name registration requests from local windows hosts with 
incorrect WINS address. 

- Malicious probes and worm traffic that has an affinity for hosts within their classful 
network. 

Figure 3 shows the traffic observed from only remote sources in a typical week at 
the campus-enterprise iSink. There are several notable features. The dominant protocol 
is TCP since the campus border routers filter scans to port 1434 (ms-sql-m) that was 
exploited by the SQL-Slammer worm [16]. The peak rate of traffic is about IMb/s and 
1500 packets per second. There is no obvious periodicity in this dataset. Finally, because 
TCP is the dominant protocol, the packet sizes are relatively constant and the number 
of bytes and packets follow a predictable ratio. Hence, the graphs of bit and packet rate 
show very similar trends. 

4.2 Service Provider iSink Case Study 

The volume of unsolicited inbound traffic to the class A network varied between average 
rates of 5,000 packets-per-second (pps) when we brought the system on line to over 
20,000pps six months later at the end of our study. One consequence that was relayed 
to us by experienced network operators is that it is not possible to effectively operate 
even this relatively quiescent class A network at the end of a 1.5 megabit-per-second 
T1 link because the link becomes completely saturated by this unsolicited traffic. 

To operate the service-provider iSink continuously, we originally assumed that we 
could safely introduce the class A least-specific /1 6 route for the iSink and still allow 
operators to occasionally introduce more-specific routes to draw the network’s traffic 
elsewhere in the Internet when need-be. While sound in theory (according to “CIDR 
and Classful Routing” [24]), it didn’t work in practice. Because today’s Internet is bi- 
furcated into commercial/commodity networks and research/education networks (In- 
ternet2’s Abilene), some institutions connected to both types employ creative routing 
policies. We found that some sites prefer less-specific routes over more-specific when 
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Table 2. Top Services (Service Provider Sink) 



Service: 


Inbound flows per second 


udpjietbios-ns.dst 


1932 


udp_ms-sql-m_dst 


1187 


http_dst 


197 


netbios-ssn.dst 


133 


microsoft-ds_dst 


115 


smtp_dst 


67 


http_src 


44 


https_dst 


11 


ms-sql-s_dst 


10 


telnet_dst 


2 



Inbound Backscatter Packets 

Class-A Network Sink (!6M Addresses), 12 hours 




Table 3. Backscatter sources (victims) in ser- 
vice provider sink (12 hrs - 5 min avg) 



Type 


Num IPs 


% IPs 


TCP.RST 


295 


38% 


TCP_SYN_RST 


105 


14% 


TCP_ACK 


81 


10% 


TCP_ACK_RST 


80 


10% 


ICMPJNTRANS_TIME_EXCEEDED 


58 


7% 


ICMP_PORT_UNREACH 


29 


4% 


ICMP_PKT_FILTERED_UNREACH 


23 


3% 


TCP_SYN_ACK 


10 


1% 


ICMP_HOST_UNREACH 


6 


1% 


OTHER 


87 


11% 



Periodic Service Probing, period = -2.67 hours 



Class-A Network Sink (16M Addresses) 




Fig. 5. Time-volume graph of backscatter Fig. 6. Inbound flows (per second) observed 

packet types on service-provider sink over a at service-provider sink on ports 139 and 445 

typical 12 hour period over a typical week 



the less-specific route is seen on what is likely to be a higher-performance (or fixed 
cost) service such as Internet2. 

Figure 4 depicts the traffic observed in a typical week at the service-provider iSink. 
Unlike the campus-enterprise network, the dominant protocol is UDP, most of which 
can be attribute to Windows NetBIOS scans on port 137 and the ms-sql-m traffic from 
worm attempting to exploit the vulnerable MS-SQL monitor. Since UDP traffic with 
payloads of varying sizes dominates, there is no strong correspondence between the 
graphs for bytes and packets. The most interesting feature is the striking periodic be- 
havior of the TCP flows, discussed in more detail in the section 4. Table 2 provides a 
summary of the inbound per second flow rate of the top services. 

Analysis of Backscatter Packets. Backscatter packets are responses to spoofed DoS 
attacks and have been effectively used to project Internet wide attack behavior [19]. 
Figure 5 provides a time series graph of the backscatter packet volume observed in our 
service-provider sink. Noteworthy features include the following: 

1 . TCP packets with ACK/RST dominate as might be expected. This would be the 
most common response to a SYN flood from forged sources. 
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Fig. 7. Left: Volume/Count of type-1 port 139 scanners: 24 hours, Dec 14, 2003, (no. sources per 
peak = 100, total sources = 2,177) Right: Volume of type-5 portl39 scanners 



2. Vertical lines that correspond to less common short duration spikes of SYN/ACK 
and SYN/ACK/RST. 

3. ICMP TTL exceeded packets could he attributed to either routing loops or DoS 
floods with a low initial TTL. 

Table 3 provides a summary of the number of active sources of backscatter traffic, 
i.e., the estimated count of the victims of spoofed source attacks. These numbers are 
an average during the 12 hours shown in Figure 5 of the number of sources in each 5 
minute sample. In terms of the distribution of the volumes of Backscatter scan types, 
our results are consistent with those published in [19]. Backscatter made up a small 
percentage (under 5%) of the overall traffic seen on our service-provider sink. 

We proceeded to setup a controlled experiment which began by trying to infect 
a Windows 2000 host running on VMware with LovGate. LovGate uses a dictionary 
attack, so we expected a machine with blank administrative password to be easily in- 
fected. However, the NetBIOS sessions were continually getting rejected due to Net- 
BIOS name mismatches. So we modified the Imhosts file to accept the name *SMB- 
SERVER enabling us to capture the worm. 

We verified that LovGate’s NetBIOS scanning process matched the profile of the 
type-1 scanners^. To date, we have not been able to disassemble the binary as it is a 
compressed self-extracting executable. So we monitored the scans from the infected 
host. There were two relevant characteristics that provide insight into the periodicity: 
1) The scanning process is deterministic, i.e., after every reboot it repeats the same 
scanning order 2) During the course of a day there are several 5-10 minute intervals 
where it stops scanning. Our conjecture is that these gaps occur due to approximately 
synchronized clocks in the wide area thus producing the observed periodicity. 

SMTP Hot-Spot. Analysis of SMTP (Simple Mail Transfer Protocol) scans in the 
service provider sink is another important demonstration of active sink’s capabilities. 
Prom passive measurements, we identified an SMTP hot-spot i.e., there was one IP 

^ Besides the NetBIOS scanning LovGate also sent SMTP probes to www. 163 . com. 
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address that was attracting a disproportionately large number of SMTP scans (20-50 
scans per second). Hot-spots in unused address space are typically good indicators of 
misconfiguration. During a 10 day period in December we observed over 4.5 million 
scans from around 14,000 unique IP addresses all bound to one destination IP within 
our monitor. A cursory analysis suggested that these scans were all from cable-modem 
and DSL subscribers. Finally, the scans also seemed to have an uncommon TCP SYN 
bngerprint (win 8192, mss 1456). 

The possibility of spam software as a source of this anomaly was ruled out due 
to the non-standard TCP fingerprint. We then hypothesized that this could be from a 
specific cable-modem or DSL device. We set up an SMTP responder on the target IP 
address and captured the incoming email. This revealed the source of the email to be 
misconflgured wireless-router/flrewall systems from a major vendor®. The emails 
are actual firewall logs ! 

To better understand the reasons behind this SMTP hot-spot, we examined the fire- 
wall system’s firmware. The unar j utility was used to extract the compressed binary. 
However, searching for the hot-spot IP address string in the binary proved fruitless. 
Examination of the hrmware “application” revealed that there was an entry for SMTP 
server that was left blank by default. This led us to conjecture that the target IP address 
was the result of an uninitialized garbage value that was converted to a network ordered 
IP address. It also turns out that every byte in our hot-spot address is a printable ASCII 
character. So we searched for this four byte ASCII string and found a match in almost 
all versions of firmware for this device. The string occurred in both the extracted and 
compressed versions of the hrmware. As a sanity check, we looked for other similar 
ASCII strings, but did not hnd them. These kind of hot-spots can have very serious 
ramihcations in network operations. For example, one the authors discovered a similar 
problem with Netgear routers that inadvertently flood our campus NTP servers [22]. 

® We are in the process of notifying the manufacturer and plan to reveal the name of the vendor 
once this is completed. 
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Experiences with Recent Worms. Our iSink deployment has proved quite useful in 
detecting the advent of recent worms such as Sasser [5]. Without active response capa- 
bility, such as that provided by the Active Sink, it would be impossible to distinguish 
existing worm traffic on the commonly exploited ports such as port 445 from new worm 
activity. Detection of such new worms is often possible without modifications to the re- 
sponder, as was the case for the Isarpc exploit used by Sasser. Our active response 
system enabled accurate detection of not only Sasser, but also more fine-grained classi- 
fication of several variants. Prior to the release of Sasser, we were also able to observe 
early exploits on the Isarpc service which could be attributed to certain strains of 
Agobot. Figures 8 and 9 illustrate the interaction of RBOT.CC [30], a more recent 
virus that also exploits the Isarpc vulnerability, with the Active Sink. 

5 Basic Performance 

One of the primary objectives of the iSink’s design is scalability. We performed scala- 
bility tests on our Active Sink implementation using both TCP and UDP packet streams. 
The experimental setup involved four 2GHz Pentium 4 PCs connected in a common lo- 
cal area network. Three of the PCs were designated as load generators and the fourth 
was the iSink system that promiscuously responded to all ARP requests destined to 
any address within one class A network. Figures 10 demonstrates the scalability under 
of LaBrea^ and Active Sink under TCP and UDP stress tests. The primary difference 
between the TCP and UDP tests is that the TCP connection requests cause the iSink 
machine to respond with acknowledgments, while the UDP packets do not elicit a re- 
sponse. Ideally, we would expect the number of outbound packets to equal the number 
of inbound packets. The Click-based Active Sink scales well to TCP load with vir- 
tually no loss up to about 20,000 packets (connection attempts) per second. LaBrea 
performance starts to degrade at about 2,000 packets. The UDP test used 300 byte UDP 
packets (much like the SQL-Slammer worm). In this case, both the LaBrea and Active 
Sink perform admirably well. LaBrea starts to experience a 2% loss rate at about 15,000 
packets/sec. 

6 Sampling 

There are three reasons why connection sampling can greatly benefit an iSink architec- 
ture: (i) reduced bandwidth requirements, (ii) improved scalability, (Hi) simplified data 
management and analysis. In our iSink architecture, we envision building packet-level 
sampling strategies in the Passive Monitor and source-level sampling in the NAT Filter. 

We considered two different resource constraint problems in the passive portion of 
the iSink and evaluated the use of sampling as a means for addressing these constraints. 
We first considered the problem of a fixed resource in the iSink itself. Estan and Vargh- 
ese in [6] describe sampling methods aimed at monitoring “heavy hitters” in IP flows 
through routers with a limited amount of memory. We adapted one of these methods for 
use in iSink. Second, we considered the problem of bandwidth as the limited resource. 

^ We compare Active Sink with LaBrea because unlike LaBrea, Honeyd is stateful(forks a pro- 
cess per connection), and hence is much less scalable. Since Honeyd also relies on a packet 
filter LaBrea’s scalability bounds affect Honeyd as well. 
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Fig. 10. Scalability of Click-based Internet Sink and LaBrea for TCP (left) and UDP (right) flows 



In this case, the idea is to reduce the total amount of traffic routed to an iSink by se- 
lecting subnets within the total address space available for monitoring. These methods 
would be used in combination with the filtering methods described in Section 3.3. 

Memory Constrained iSink Sampling. The method that forms the basis of our sam- 
pling approach with a memory constrained iSink is called Sample and Hold [6]. This 
method accurately identifies flows larger than a specified threshold (i.e., heavy hitters). 
Sample and hold is based on simple random sampling in conjunction with a hash table 
that is used to maintain flow ID’s and byte counts. Specifically, incoming packets are 
randomly sampled and entries in the hash table are created for each new fiow. After 
an entry has been created, all subsequent packets belonging to that flow are counted. 
While this approach can result in both false positives and false negatives, its accuracy 
is shown to be high in workloads with varied characteristics. We apply sample and hold 
in iSink to the problem of identifying “heavy hitters”, which are the worst offending 
source addresses based on the observed number of scans. 

Adapting the sample and hold method to the iSink required us to define the size 
of the hash table that maintains the data, and the sampling rate based on empirical 
observation of traffic at the iSink. In [6], the objective is identifying accurately the 
flows that take over T% of a link’s capacity. An oversampling factor O is then selected 
to reduce the possibility of false negatives in the results. These parameters result in 
allocating HTien = 1/T * O locations in each hash table. The packet sampling rate 
is then set to HTif^n/C where C is the maximum packet transmission capacity of the 
incoming link over a specified measurement period t. At the end of each t, the hash 
table is sorted and results are produced. 

Bandwidth Constrained iSink Sampling. In the bandwidth constrained scenario, the 
sampling design problem is to select a set of subnets from the total address space that is 
available for monitoring on the iSink. The selection of the number of subnets to monitor 
is based on the bandwidth constraints. In this case we assume that we know the mean 
and variance for traffic volume on a “typical” class B or class C address space. We then 
divide the available bandwidth by this value to get the number of these subnets that can 
be monitored. The next step is to select the specific subnets within the entire space that 
will minimize the error introduced in estimates of probe populations. 
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Fig. 11. Error rates for different hash table sizes (x-axis is log scale) using Sample and Hold 
method with rates of 1/100 (left) and 1/300 (right) 



Our analysis in this paper is based on the use of random sampling as a means for 
subnet selection. Our rationale for this approach is based on the observation that over- 
all traffic volumes across the service-provider class A address space that we monitor is 
quite uniform. The strengths of this approach are that it provides a simple method for 
subnet selection, it provides unbiased estimates and it lends itself directly to analysis. 
The drawback is that sampling designs that take advantage of additional information 
such as clustered or adaptive sampling could provide more accurate population esti- 
mates. We leave exploration of these and other sampling methods to future work. 

After selecting the sampling design, our analysis focused on the problem of de- 
tectability. Specifically, we were interested in understanding the accuracy of estimates 
of total probe populations from randomly selected subsets. If we consider f is an unbi- 
ased estimator of a population total r then the estimated variance of f is given by: 

N-n \ f 1 

N j n \ p J n 

where N is the total number of units (in our case, subnets), n is the sampled number 
of units, fjL is the population mean (in our case, the mean number of occurrences of a 
specific type of probe), is the population variance and p is the probability of detec- 
tion for a particular type of probe. In the analysis presented in Section 6.1, we evaluate 
the error in population estimates over a range of detection probabilities for different 
size samples. The samples consider dividing the class A address space into its com- 
ponent class B’s. The probabilities relate directly to detection of worst offenders (top 
sources of unsolicited traffic) as in the prior sampling analysis. The results provide a 
means forjudging population estimation error rates as a function of network bandwidth 
consumption. 



var{r) = N'^ 



6.1 Sampling Evaluation 

Our evaluation of the impact of sampling in an iSink was an offline analysis using traces 
gathered during one day selected at random from the service-provider iSink. Our objec- 
tive was to empirically assess the accuracy of sampling under both memory constrained 
and bandwidth constrained conditions. In the memory constrained evaluation, we com- 
pare the ability to accurately generate the top 100 heavy hitter source list over four 
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Fig. 12. Estimated error (a/fi) for a given number of randomly selected /16’s. Included are error 
estimates for total probes from worst offending source IP over a range of detection probabilities 
(left), estimates for total probes for worst offender lists of several sizes (right) 



consecutive 1 hour periods using different hash table sizes and different sampling rates. 
For each hour in the data set, we compare the percentage difference in the number of 
scans generated by the “true” top 100 blacklist and sampled top 100 blacklist sources. 
In the bandwidth constrained evaluation, we consider accuracy along three dimensions: 
1 ) estimating the worst offender population with partial visibility, 2) estimating black 
lists of different lengths, 3) estimating backscatter population. 

Our memory constrained evaluation considers hash table sizes varying from 500 to 
64K entries where each entry consists of a source IP and a access attempt count. Note 
that the hash table required to maintain the complete list from this data was on the order 
of 350K entries We consider two different arbitrarily chosen sampling rates - 1 in 100 
and 1 in 300 with uniform probability. In each case, once a source IP address has been 
entered into the table, all subsequent packets from that IP are counted. If tables become 
full during a given hour then entries with the lowest counts are evicted to make room 
for new entries. At the end of each hour, the top 100 from the true and sampled lists are 
compared. New lists are started for each hour. The results are shown in Figure 11. These 
results indicate that even coarse sampling rates (1/300) and relatively small hash tables 
enable fairly accurate black lists (between 5%-10% error). The factor of improvement 
between sampling at 1/100 and 1/300 is about 1 .5, and there is little beneht to increasing 
the hash table size from 5,000 to 20,000. Thus, from the perspective of heavy hitter 
analysis in a memory constrained system, sampling can be effectively employed in 
iSinks. 

As discussed in the prior section in our bandwidth constrained evaluation we con- 
sider error introduced in population estimates when using simple random sampling over 
a portion of the available IP address space. We argue that simple random sampling is 
appropriate for some analysis given the uniform distribution of traffic over our class A 
monitor. The cumulative distribution of traffic over a one hour period for half of the 
/16 subnets in our class A monitor is shown in Figure 13(right). This figure shows that 
while traffic across all subnets is relatively uniform (at a rate of about 320 packets per 
minute per /1 6), specific traffic subpopulations - TCP backscatter as an example - can 
show significant non-uniformity which can have a significant impact on sampling. 

We use the mean normalized standard deviation ((r//i) as an estimate of error in our 
analysis. In each case, using the data collected in a typical hour on the /8, we empirically 






On the Design and Use of Internet Sinks for Network Abuse Monitoring 



163 




Fig. 13. Estimated error for TCP backscatter traffic (left). Cumulative distribution of all traffic 
and TCP backscatter traffic across half of the class A address space monitor over a one hour 
period. The average number of probes per /16 is 320 packets per minute (right) 



assess the estimated error as a function of a randomly selected sample of /16 subnets. 
The results of this approach are shown in Figure 12. The graph on the left shows the 
ability to accurately estimate the number of probes from the single worst offending 
IP source over a range of detection probabilities (i.e., the probability of detecting a 
source in a selected /16). This graph indicates that worst offenders are detectable even 
with a small sample size and error-prone or incomplete measurements. The graph on 
the right shows the ability to accurately estimate black lists from a selected sample of 
/16’s. This graph indicates that it is easier to estimate larger rather than smaller black 
lists when sampling. We attribute this to the variability in black list ordering across the 
/16’s. Finally, Figure 13(left) shows the ability to accurately estimate TCP backscatter 
traffic over a range of detection probabilities. The graph suggests that while backscatter 
estimates are robust in the face of error-prone or incomplete measurements, estimated 
error of total backscatter is quite high even with a reasonably large number of /16’s. 
This can be attributed to the non-uniformity of backscatter traffic across the class A 
monitor shown in Figure 13(right) and suggests that alternative sampling methods for 
backscatter traffic should be explored. On a broader scale, this indicates that traditional 
backscatter methodologies that assumes uniformity could be error prone. 



7 Summary and Future Work 

In this paper we describe the architecture and implementation of an Internet Sink: a 
useful tool in a general network security architecture. iSinks have several general de- 
sign objectives including scalability, the ability to passively monitor network traffic on 
unused IP addresses, and to actively respond to incoming connection requests. These 
features enable large scale monitoring of scanning activity as well as attack payload 
monitoring. The implementation of our iSink is based on a novel application of the 
Click modular router, NAT Filter and the Argus flow monitor. This platform provides an 
extensible, scalable foundation for our system and enables its deployment on commod- 
ity hardware. Our initial implementation includes basic monitoring and active response 
capability which we test in both laboratory and live environments. 
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We report results from our iSink’s deployment in a live environment comprising 
four class B networks and one entire class A network. The objectives of these case 
studies were to evaluate iSink’s design choices, to demonstrate the breadth of informa- 
tion available from an iSink, and to assess the differences of perspective based on iSink 
location in IP address space. We show that the amount of traffic delivered to these iSinks 
can be large and quite variable. We see clear evidence of the well documented worm 
traffic as well as other easily explained traffic, the aggregate of which can be considered 
Internet background noise. While we expected overall volumes of traffic in the class B 
monitors and class A monitor to differ, we also found that the overall characteristics 
of scans in these networks were quite different. We also demonstrate the capability of 
iSinks to provide insights on interesting network phenomenon like periodic probing and 
SMTP hot-spots, and their ability gather information on sources of abuse through sam- 
pling techniques. A discussion of operational issues, security, and passive fingerprinting 
techniques is provided in [32]. 

The evaluation of our iSink implementation demonstrates both its performance ca- 
pabilities and expectations for live deployment. From laboratory tests, we show that 
iSinks based on commodity PC hardware have the ability to monitor and respond to 
over 20,000 connection requests per second, which is approximately the peak traffic 
volume we observed on our class A monitor. This also exceeds the current version of 
LaBrea’s performance by over 100%. Furthermore, we show that sampling techniques 
can be used effectively in an iSink to reduce system overhead while still providing ac- 
curate data on scanning activity. 

We intend to pursue future work in a number of directions. First, we plan to expand 
the amount of IP address space we monitor by deploying iSinks in other networks. 
Next, we intend to supplement iSink by developing tools for datamining and automatic 
signature generation. 
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Abstract. Intrusion detection systems typically create large amounts 
of alerts, processing of which is a time consuming task for the user. This 
paper describes an application of exponentially weighted moving average 
(EWMA) control charts used to help the operator in alert processing. 
Depending on his objectives, some alerts are individually insignificant, 
but when aggregated they can provide important information on the 
monitored system’s state. Thus it is not always the best solution to 
discard those alerts, for instance, by means of filtering, correlation, or 
by simply removing the signature. We deploy a widely used EWMA 
control chart for extracting trends and highlighting anomalies from alert 
information provided by sensors performing pattern matching. The aim 
is to make output of verbose signatures more tolerable for the operator 
and yet allow him to obtain the useful information available. The applied 
method is described and experimentation along its results with real world 
data are presented. A test metric is proposed to evaluate the results. 

Keywords: IDS background noise, alert volume reduction, EWMA 

1 Introduction 

Perfectly secure systems have been shown to be extremely difficult to design and 
implement, thus practically all systems are vulnerable to various exploits or at 
least to legitimate users’ privilege abuse. Systems used to discover these attacks 
are called Intrusion Detection Systems (IDSes). 

The work in intrusion detection begun from the need to automate audit trail 
processing [1] and nowadays IDSes themselves can generate enormous amounts 
of alerts. Just one sensor can create thousands of alerts each day, and a large 
majority of these can be irrelevant [2] , partly because the diagnostic capabilities 
of current, fielded intrusion detection systems are rather weak [3] [4]. This alert 
flood can easily overwhelm the human operating the system and the interesting 
alerts become buried under the noise. 

1.1 Alert Overflow and Correlation 

The need to automate alert processing and to reduce the amount of alerts dis- 
played to the operator by the system is a widely recognized issue and the research 
community has proposed as one solution to correlate related alerts to facilitate 
the diagnostics by the operator [5] . 
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Alert correlation has three principal objectives with regard to information 
displayed to the operator: 

Volume reduction: Group or suppress alerts, according to common proper- 
ties. E.g. several individual alerts from a scan should be grouped as one 
meta alert. 

Content improvement: Add to the information carried by individual alert. 
E.g. the use of topology and vulnerability information of monitored system 
to verify or evaluate the severity of the attack. 

Activity tracking: Follow multi-alert intrusions evolving as time passes. E.g. 
if attacker first scans a host, then gains remote-to-local access, and finally 
obtains root access, individual alerts from these steps should be grouped 
together. 

We perform volume reduction eliminating redundant information by aggre- 
gating alerts that are not strictly symptoms of compromise and appear in high 
volumes. Only changes in the behavior of the aggregate flow are reported to the 
user. Correlation techniques capable of detecting unknown, novel relationships 
in data are said to be implicit and techniques involving some sort of definition of 
searched relationships are called explicit. As the aggregation criteria is manually 
selected, this is an explicit correlation method. Overall, we aim to save operator 
resources by freeing the majority of time units that manually processing the 
background noise would require and thus to enable him to focus on more rele- 
vant alerts. Even though this manual processing is likely to be periodic skimming 
through the accumulated noise, if there are several sources with omnipresent ac- 
tivity, the total time used can be significant. Next we discuss why despite the 
large amounts of alerts background noise monitoring can be useful. 

1.2 The Need for Other Type of Processing 

Also according to our experience (see Sect. 2.3) a relatively large portion of alerts 
generated by a sensor can be considered as background noise of the operational 
system. However, the division to true and false positives is not always so black 
and white. The origins of problem can be coarsely divided to three. 1) Regardless 
of audit source, the audit data usually does not contain all required technical 
information, such as the topology and the vulnerability status for the monitored 
system for correct diagnosis. 2) The non-technical contextual factors, such as op- 
erator’s task and the mission of the monitored system, have an effect on which 
types of alerts are of high priority and relevant. 3) Depending on the context of 
the event, it can be malicious or not, and part of this information can not be ac- 
quired by automated tools or inferred from the isolated events. For the first case, 
think of a Snort sensor that does not know if the attack destination is running 
a vulnerable version of certain OS or server and consequently can not diagnose 
whether it should issue an alert with very precise prerequisites for success. An 
example of the second is a comparison of on-line business and military base. At 
the former the operator is likely to assign high priority on the availability of the 
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company web server, and he might easily discard port scans as minor nuisance. 
At the latter the operator may have only minor interest towards the availability 
of the base web server hosting some PR material, but reconnaissance done by 
scanning can be considered as activity warranting user notification. Instead of 
high priority attacks, the third case involves action considered only as potentially 
harmful activity, such as ICMP and SNMP messages that indicate information 
gathering or network problems, malicious as well as innocuous as part of normal 
operation of the network. Here the context of the event makes the difference, 
one event alone is not interesting, but having a thousand or ten events instead of 
the normal average of a hundred in a time interval can be an interesting change 
and this difference can not be encoded into signature used by pattern matching 
sensor. 

This kind of activity can be see to be on the gray area, and the resulting 
alerts somewhere between false and true positive. Typically the operator can 
not afford to monitor it as such because of the sheer amount of events. The 
current work on correlation is largely focusing on how to pick out the attacks 
having an impact on monitored system and show all related events in one attack 
report to the operator. Depending on the approach, the rest of the alerts are 
assigned such a low priority that they do not reach the alert console [3], or 
they can be filtered out before entering the correlation process [6,4]. However, 
if the signature reacting to gray area events is turned on, the operator has some 
interest towards them. Therefore it is not always the best solution to only dismiss 
these less important alerts albeit their large number. Monitoring aggregated 
flows can provide information about the monitored system’s state not available 
in individual alerts, and with a smaller load on operator. Our work focuses on 
providing this type of contextual information to the user. 

1.3 Objectives 

We want to examine the possibility to track the behavior of alert flows and the 
goals for this work are following: 

1. Highlight abnormalities. The primary interest was to detect interesting 
artifacts from high volume alert flows. Even though traffic considered only 
harmful is so commonplace that operators can not usually afford to monitor 
it as such, by focusing on abnormal changes, the burden would be smaller. 

2. Reduce the number of alerts handled by the operator while retain- 
ing the information source. We would also like to point out only the inter- 
esting artifacts. Given the first objective, a significant alert reduction would 
be required. If achieved, it would be feasible to keep these signatures acti- 
vated in the system, providing the operator the desired information through 
the aggregates. A sufficiently low artifact output would also allow the use 
of this method in parallel with other correlation components for additional 
information, regardless how the correlation engine considers the alerts. 
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3. Measure the reduction. We do not know what we do not measure. To see 
the effect of the techniques some metrics are needed. This goal is essentially 
linked to the next one, the determination of the applicability. It is also im- 
portant for the operator to know how much is being suppressed for better 
situation understanding. 

4. Determine the applicability of the approach with different alert 
flows. The applicability of the method for processing high volume alert flows, 
and more specifically with different types of alert flows, needs to be defined. 

5. Trend visualization. The power of human visual processing capability has 
been found useful also in intrusion detection, especially to detect anomalies 
or patterns difficult to handle for an AI [7]. To improve operator’s view of 
current system state and the nature of anomalies, alert flow and trend ought 
to be visualized. 

A variation of the EWMA control charts is proposed to achieve these goals. 
This control chart was designed and has been used to process alerts created and 
logged by sensors deployed in a production system at France Telecom. 

The rest of the paper describes our findings. Sect. 2 presents the EWMA 
control charts and our variation. Section 3 describes practical experimentation 
and proposes a new metric for evaluating our approach. Related work is viewed 
in Sect. 4 and we offer our conclusions in Sect. 5. 

2 EWMA Control Charts 

In this section we present shortly mathematical backgrounds of EWMA, its use 
for control chart procedure, and then our variation for noise monitoring. 

2.1 Backgrounds in Statistical Process Control 

EWMA control charts were originally developed for statistical process control 
(SPC) by Roberts [8], who used the term geometric moving averages instead 
of EWMA, and since then the chart and especially the exponentially weighted 
moving average have been used in various contexts, such as economic applications 
and intrusion detection [9-11]. More details are available in Sect. 4. 

In SPC a manufacturing process is seen as a measurable entity with a dis- 
tribution. The over-all quality of the product resulting from the process is con- 
sidered dependent on the process mean, which is to be kept at the fixed level 
and the variations as small as possible. An EWMA control chart can be used 
to monitor the process mean by maintaining an exponentially moving average of 
the process value. The average is compared to preset control limits, defining the 
acceptable range of values. Next we describe this procedure in more detail. 

The exponentially weighted moving average is defined as 

Zi — (1 \Xi , ( 1 ) 

where 0 < A < 1 . Here Zi is the current value of exponentially smoothed average, 
Zi-i the previous smoothed value, and Xi is the current value of monitored 
statistic. The name exponential smoothing is also used for this type of averaging. 
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This recursive formulation distinguishes EWMA from the basic moving av- 
erages, such as simple moving average or weighted moving average. Exponential 
smoothing takes all past data into account with significance decaying expo- 
nentially as a function of time. However, at the same time only the previous 
smoothed value and current measure are required to compute the new smoothed 
value. The decay is controlled by the factor A and (1 — A) is called the smoothing 
factor. The name becomes more apparent by rewriting (1) as 

Zi = \xi -\- A(1 — A)^a;i_i -I- A(1 — X)^Xi—2 -I- . . . (2) 

. . . + A(1 - \y-‘^X2 + A(1 - A)*-ixi + (1 - A)*:eo , 

where f > 0 . Now it can be seen that the current data from time instant i 
receives weight A and old data from instant i — j receives weight A(1 — A)-^ . If 
the interest is in the long-term trend, large smoothing factors should be used 
and vice versa. 

As the monitored statistic in (1) is process mean, Xi is the average of subgroup 
of n samples taken at time instant i. The standard deviation for 2 can be obtained 
with equation 

(3) 

Since Xi is an average of n samples ct® is Oxl^/n, where Gx is the the standard 
deviation of x, supposed to be known a priori. 

The upper and lower control limits (UCL, LCL) set as 

xq ± 3gz (4) 

define the interval for z where process is deemed to be under control. Here xq is 
the nominal average of the process, also supposed to be known a priori. For each 
new measurement, the current value of statistic z is calculated by using (1), and 
if the control limits are surpassed, instability is signaled. 

Exponential smoothing can be approximated by a standard moving average 
with a window size n. According to Roberts [8], for a given A a roughly equivalent 
window size n is determined from 

n = j-1 . (5) 

With this equation the decay speed becomes more intuitive knowing the mea- 
surement interval length. We call the product of n and the sampling interval 
length for x the memory of the statistic, as events older than that have only 
little significance on current z. For example, smoothing factor 0.92 would trans- 
late to a window size 24, and for 0.80 corresponding n is 9. This shows also how 
larger smoothing factor corresponds to averaging over larger number of samples. 

2.2 The Control Chart for Alert Flows 

Our needs differ from those of Roberts’ quite much, and also to a smaller degree 
from those of the related work in intrusion detection (Sect. 4). Below our varia- 



a, = 



2- A 
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tion of the technique is described, building largely on [11], and the rationale for 
changes and choices is provided. 

The monitored measure is the alert intensity of a flow x, the number of 
alerts per time interval. One alert flow consists typically of alerts generated by 
one signature, but also other flows, such as alerts generated by a whole class of 
signatures, were used. Intensity x is used to form the EWMA statistic of (1). 
This statistic is called the trend at time i. 

It is quite impossible to define a nominal average as the test baseline xg 
for (4), since these flows evolve significantly with time. Like Mahadik et al. [11], to 
accommodate the dynamic, non-stationary nature of the flows, the test baseline 
is allowed to adapt to changes in alert flow, and the control limits for time instant 
i are 

Zi_i ± n • . (6) 

Here n is a factor expressing how large a deviation from trend is acceptable and 
(Tz_i is z’s standard deviation at interval i — 1. The control limits for interval i 
are thus calculated using trend statistics from interval i — 1. 

To obtain the standard deviation az, another EWMA statistic 

z^ = (l-A)zl, + Axf (7) 

is maintained, where Xi is the current intensity as in trend calculation. The 
standard deviation is computed as 

= t / zi - {z^Yi . (8) 

Now for each new interval and for each alert flow, 1) the alert intensity 
is measured, 2) the control limits are calculated, and 3) the decision whether 
the interval is abnormal or not is taken. Both [9] and [11] test smoothed event 
intensity against the control limits. They use a larger smoothing factor with (1) 
to obtain the baseline Zi from (6) and apply (1) with smaller smoothing factor 
to have smoothed event intensity that is tested against control limits. This is 
done to reduce the effect of wild values in the observations. However, in the case 
of alerts, these outliers are usually of interest for the operator and testing the 
raw intensity, or in other words using (1 — A) = 0 to obtain the value that is 
tested against control limits, gave us better capability to capture small variations 
occurring in some extremely stable alert flows. 

2.3 Learning Data 

The tool was developed for an IDS consisting of Snort sensors logging alerts 
into a relational database. The sensors are deployed in a production network, 
one closer to Internet and two others in more protected zones. This adds to the 
difficulty of measuring and testing, since we do not know the true nature of 
traffic that was monitored. On the other hand, we expect the real world data to 
contain such background noise and operational issues that would not be easily 
incorporated to simulated traffic. 
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Table 1. Five most prolific signatures in the first data set 



signature name 


number of alerts 


proportion 


SNMP Request udp 


176 009 


30% 


ICMP PING WhatsupGoId Windows 


72 427 


13% 


ICMP Destination Unreachable (Comm 
Adm Proh) 


57 420 


10% 


LOCAL-POLICY External connexion from 
HTTP server 


51 674 


9% 


ICMP PING Speedera 


32 961 


6% 


sum 


390 491 


68% 



The data set available to us in this phase contained over 500 K alerts accumu- 
lated during 42 days. Of the 315 activated signatures, only five were responsible 
for 68 % of alerts as indicated in Table 1 and we chose them for further scrutiny. 
To give an idea of the alert flow behavior, examples of alert generation intensi- 
ties for four of these signatures are depicted in Fig. 1 and the fifth, ICMP PING 
WhatsupGoId Windows, is visible in Figs. 2 and 3 (described with more details 
in Sect. 2.4). The relatively benign nature of these alerts and their high volume 
was one of the original inspirations for this work. These alerts are good examples 
of the problem three discussed in Sect. 1.2 and demonstrate the reason why we 
opt not just Alter even the more deterministic components out. For example, 
the alert flow in Fig. 1(c) triggered by SNMP traffic over UDP had only few 
(source, destination) address pairs, and the constant component could be easily 
filtered out. However, this would deprive the operator being notified of behavior 
such as the large peak and shift in constant component around February 15*^ 
as well or the notches in the end of February and during March 15*^. Not neces- 
sarily intrusions, but at least artifacts worth further investigation. On the other 
hand, we do not want to distract the operator with the alerts created during 
the hours that represent the stable situation with constant intensity. For the 
others. Fig. 1(a) shows alerts from a user defined signature reacting to external 
connections from an HTTP server. The alerts occur in bursts as large as several 
thousands during one hour and the intensity profile resembles impulse train. As 
custom made, the operator has likely some interest in this activity. In Figs. 1(b) 
and 1(d) we have alerts triggered by two different ICMP Echo messages, former 
being remarkably more regular than latter. In the system in question, deactiva- 
tion of ICMP related signatures was not seen as an solution by the operator as 
they are useful for troubleshooting problems. Consequently, we had several high 
volume alert flows for which the suppression was not the first option. 

2.4 Deploying the Control Chart 

The behavior of the model defined in Sect. 2.2 was explored with five flows made 
up from these five signatures (Figs 1, 2, and 3) by varying several parameters. 
A combination that would 1) catch desired artifacts from the alert flow and 
2) create as small amount of new alerts as possible, was searched. Not having 
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Fig. 1. Hourly alert intensity for some of the most prolific signatures in learning data. 
Horizontal axis is the time, and the vertical shows the number of alerts per hour 



an exact definition for interesting artifacts from the real users, we had to look 
for behavior that seemed to us worth further investigation. In addition to the 
actual parameters, also different aggregation criteria and input preprocessing 
were used. 

Setting chart parameters. The width of control limits in (6) was set to 
three standard deviations as already proposed by Roberts [8]. Values {1,2, 3, 6} 
were tried before making the choice. The memory of the chart depends on the 
smoothing factor and sampling interval length. Figures 2 and 3 depict the effect 
of memory length on the trend and control limits with sampling interval of 
one hour and (1 — A) with values 0.8 and 0.99407, respectively. The smaller 
smoothing factor results in trend and control limits following the current value 
closely. The difference between the trend and reality is nearly invisible in the 
Fig. 2, and the control limits tighten relatively fast after an abrupt shift in flow 
intensity. The model behavior with significantly larger smoothing factor in Fig. 3 
shows how the recent values have relatively small effect on trend. The standard 
deviation reaches such large values that the control limits absorb all variations 
in the flow. For (1 — A) in [0.2, 0.8] the flagging rate increased towards smaller 
smoothing factors, the steepness of increase varying from flow to flow. 




alert intensity alert intensity 
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Fig. 2. The effect of small smoothing factor on trend and control limits 




Fig. 3. The effect of large smoothing factor on trend and control limits 
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However, the sampling interval length had surprisingly little effect on the 
proportion of both intervals and alerts deemed as anomalous. The same applies 
also for the smoothing interval apart from the very extreme values. Figures 4 
and 5 show the proportion of anomalous alerts for two signatures triggered by 
ICMP Echo messages as the function of smoothing factor, where (1 — A) G 
[ 0.8, 0.99407 ] and sampling intervals {0.5, 1, 2, 4} hours. For both the proportion 
of alerts marked as anomalous is within range of four percentage units for except 
with the largest smoothing factors. In Fig. 4 alert flagging shoots up with largest 
smoothing factors, a phenomenon which was caused by the large difference in 
trend and current value due to lagging trend. In Fig. 5 there is a sharp drop in 
flagging as the smoothing factor increases. This inverse effect was usually related 
to wide control limits causing all flow behavior to be considered as normal. An 
example of this kind of situation is visible in the right half of Fig. 3. 

Setting the sampling interval to one hour and using smoothing factors 0.8 
and 0.92 allowed flagging those kinds of anomalies in alert flows considered inter- 
esting also by the visual exploration shown in Figs. 2 and 3. As stated above, the 
sampling interval length seemed have only a minor effect on the flagging rate. 
In addition, according to (5), this gives the model a memory of 9 and 24 hours, 
respectively. For the user this provides an intuitive association with workday 
and day, which is also an important aspect to consider. One hour from event 
to notification is a long time in terms of intrusion detection. However, instead 
of real time detection of compromise, the need was to heavily summarize the 
background noise and in addition our target IDS is not under constant surveil- 
lance by the operator. As such, the one hour sampling interval was considered 
suitable. Depending on the user’s requirements, the sampling frequency could 
be increased. 

To capture the time related behavior visible for some signatures, two other 
models were deployed in addition to the one monitoring the alert flow in contin- 
uous manner. The second model, later referred as daily, uses separate statistic 
for the intensity of each hour of the day. The third model, weekday, maintains 
separate statistics for weekday and weekend intensities. 

The aggregation criteria. Combining different signatures as one flow was also 
tried. For example. Snort has several signatures for different ICMP Destination 
Unreachable messages and for web traffic that were used to form two aggre- 
gates, respectively. In learning data, 232 web related signatures were activated, 
and only those Destination Unreachable signatures reacting to Communication 
Administratively events (Snort SIDs 485, 486, 487) were present. These combi- 
nations did not seem to make sense, since there were few signatures dominating 
the sums that only reflected the behavior of alerts caused by them. Only separate 
signatures and signature classes were chosen for further examination. However, 
the aggregate flows could be formed in many different ways, refer to Sect. 5 for 
more on this. 

Input processing. Instead of using the measured value of event intensity as 
Xi in (1), additional smoothing operation with small smoothing factor is done 
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in [9]. We experimented with smoothing factors 0.2 and 0.3 for the trend input, 
but the effect on alert reduction was negligible. By feeding the raw intensity 
to trend calculations, Uz reaches higher values more quickly. This means also 
that the control limits widen up rapidly, which helps to avoid flagging several 
intervals after an abrupt shift in intensity level. For example, in Fig. 1(c) as the 
intensity increases around February 15*^, the trend line lags below the real value 
for a moment. If the control limits are not wide enough, several intervals become 
flagged instead of only the one containing the shift. 

Mostly out of curiosity, cutoff values for Xi based on cr^ to limit trend updates 
were also used. This did not work well, and caused problems especially with 
stable flows. When az approached zero the trend became too slow adapting to 
drastic changes. Again, the above mentioned example with SNMP Request udp 
applies. 

In order to validate the choice of parameters and to see the suitability of our 
approach, experimentation with a larger data set was performed. 

3 Practical Experimentation 

For testing a more complete alert data base dump than the one used in learning 
phase was obtained. It contains nearly 2M alerts from 1836 signatures accumu- 
lated during 112 days from the same system. We use two statistics to measure the 
applicability of the EWMA monitoring to a particular alert flow. The statistics 
are the percentage of alerts flagged and proportion of anomalous intervals from 
intervals with non-zero activity, discussed in 3.1. For each flow these measure- 
ments are made with the three models, continuous, daily and weekday, and with 
two different smoothing factors, resulting to six statistics per flow. These results 
are presented in 3.2. Then the overall usefulness of our method for summarizing 
the alerts is evaluated with respect to number of applicable and non- applicable 
flows. 



3.1 The Rationale for the Test Metric 

As noted by Mell et al. [12] testing IDSes is no simple task and lacks rigorous 
methodology. The difficulty in our situation arises from the fact that we intend 
to signal the user when something odd is going on with the background noise 
and help him to cope with alert floods by reducing the number of alerts reported. 
As such we cannot make a strict separation between correct and false alarms, 
and using metrics like accuracy or completeness [13] is difficult. It is the user in 
the very end who decides whether the extracted information is useful or not. 

However, we can describe the summarizing capabilities with the two above 
mentioned statistics, 1) the proportion of alerts flagged as anomalous, and 2) the 
proportion of busy time slots. As the control chart signals only the abnormality 
of an interval, we count all alerts from anomalous interval to be flagged, resulting 
to a rough approximation of the alert reduction in individual alerts. Since we 
intend to summarize background noise, it is unlikely that the operator would go 
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through individual alerts even from an anomalous interval. Let us assume that 
he uses a time unit t to skim through the raw background noise generated by 
one flow during an interval T to detect that it seems normal, where T equals 
to sampling interval of EWMA monitoring. In our case T is one hour and it is 
likely that t << T . If the operator uses EWMA monitoring for the flow, he will 
be notified only when flow behaves in anomalous way. Now the time units t that 
would be used to manually detect normal behavior can be used for more useful 
tasks like treating the more severe alerts. Thus it is more interesting to look at 
the number of anomalous intervals after summarization with respect to intervals 
showing activity in the raw flow than just the alert reduction. 

The proportion of busy time slots is obtained by dividing the number of 
anomalous intervals by the number of intervals showing non-zero activity for the 
flow. The proportion indicates how constantly the operator would be bothered 
by the flow with EWMA monitoring compared to manually checking the accu- 
mulated noise every T . Small values for a flow mean smaller nuisance to the user 
where as large values tell that EWMA monitoring is not capable to summarize 
the activity of this flow. 

As an anomaly can be caused by an interval with zero alerts, the proportion 
could in theory be above unity. For example, imagine a flow with impulse train 
type profile, such as LOCAL-POLICY in Fig. 1(a), for which all active intervals 
plus some zero intensity intervals could be flagged. However, in practice we never 
saw this. For the daily and weekday models we combine the results of individual 
time slot statistics to obtain the overall performance. One drawback in these 
metrics is that there is no cost associated to them even though viewing only the 
flow anomaly alerts instead of individual alerts, some information is lost. 



3.2 Results 

As the interest is to monitor high volumed aggregates, we considered only sig- 
natures that had created more than 100 alerts. After this preselection we had 85 
signatures left. This section first describes how the flow volume affected the sum- 
marization, and the identified reasons for poor efficiency. Secondly we analyze 
the alert types causing large numbers of alerts, the impact of time slot choice 
and larger aggregates for the flow behavior, and the stability of flow profiles. 



Effect of flow volume. Judging from the busy interval reduction, the method 
is useful for alert flows that had created more than 10 K alerts, the effectiveness 
increasing with the flow volume. The busy interval reduction for flows below 10 K 
alerts is already more modest, and below 1 K alerts the reduction is relatively 
negligible. Tables 2 and 3 depict respectively the reduction as percentage from 
non-zero intervals and alerts flagged anomalous, due to space constraints only 
for flows of over 10 K alerts. Reduction is shown with smoothing factors 0.80 
and 0.92 for each three model, continuous, hourly, and weekday. In Table 2 also 
the total number of active intervals, and in Table 3 the total number of alerts 
are shown for each flow. 
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Table 2. The proportion of flagged intervals from intervals showing activity for the 
flow with different models and smoothing factors 







cont. 


daily 


weekd. 


flow 


int. 


.80 


.92 


.80 


.92 


.80 


.92 


Known DDOS Stacheldraht infection 


563 


1.6 


1.8 


8.9 


8.5 


2.0 


2.5 


SNMP request udp 


2311 


4.3 


2.9 


5.8 


4.6 


4.2 


3.0 


ICMP PING WhatsupGold Windows 


2069 


5.1 


3.3 


5.8 


2.6 


5.1 


3.2 


DDOS Stacheldraht agent^handler (skillz) 


512 


1.2 


1.6 


12 


16 


1.8 


2.1 


ICMP Dst Unr (Comm Adm Proh) 


2578 


5.4 


3.5 


6.7 


5.8 


5.4 


3.4 


ICMP PING speedera 


2456 


3.3 


1.7 


4.2 


2.9 


3.3 


0.9 


WEB-IIS view source via translate header 


2548 


5.2 


3.8 


6.4 


5.7 


5.1 


4.0 


WEB-PHP content-disposition 


2287 


6.8 


4.3 


7.7 


5.2 


6.7 


4.0 


SQL Sapphire Worm (incoming) 


1721 


2.2 


1.2 


4.9 


3.5 


2.4 


1.6 


(spp_rpc_decode) Frag RPC Records 


421 


13 


7.8 


20 


20 


12 


9.0 


(spp_rpc_decode) Incompl RPC segment 


276 


21 


13 


27 


27 


22 


13 


BAD TRAFFIC bad frag bits 


432 


34 


23 


37 


33 


35 


22 


LOCAL- WEB-IIS Nimda.A attempt 


537 


24 


16 


30 


25 


24 


16 


LOCAL-WEB-IIS CodeRed II attempt 


1229 


6.3 


4.6 


14 


14 


6.9 


5.3 


DNS zone transfer 


855 


9.7 


6.7 


13 


10 


9.8 


6.5 


ICMP LSretriever Ping 


107 


29 


26 


71 


70 


28 


23 


WEB-MISC http directory traversal 


708 


12 


9.3 


15 


13 


12 


9.5 


(spp^tream4)STLTH ACT(SYN FIN scan) 


29 


65 


58 


82 


79 


62 


62 



Table 3. The percentage of flagged alerts with different models and smoothing factors 







cont. 


daily 


weekd. 


flow 


alerts 


.80 


.92 


.80 


.92 


.80 


.92 


Known DDOS Stacheldraht infection 


308548 


1.2 


1.2 


4.4 


8.4 


1.4 


1.5 


SNMP request udp 


303201 


4.4 


3.0 


4.9 


4.4 


4.2 


3.2 


ICMP PING WhatsupGold Windows 


297437 


5.4 


4.0 


4.5 


2.9 


5.2 


3.1 


DDOS Stacheldraht agent^handler (skillz) 280685 


0.8 


1.0 


7.3 


7.0 


1.2 


1.2 


ICMP Dst Unr (Comm Adm Proh) 


183020 


32 


28 


39 


37 


32 


28 


ICMP PING speedera 


95850 


5.5 


3.1 


2.5 


2.3 


5.3 


1.4 


WEB-IIS view source via translate header 


58600 


25 


21 


12 


11 


24 


22 


WEB-PHP content-disposition 


48423 


18 


14 


15 


13 


18 


14 


SQL Sapphire Worm (incoming) 


38905 


3.0 


1.9 


11 


9.1 


3.1 


2.5 


(spp_rpc_decode) Frag RPC Records 


38804 


63 


62 


94 


93 


63 


62 


(spp_rpc_decode) Incompl RPC segment 


28715 


64 


62 


93 


93 


64 


62 


BAD TRAFFIC bad frag bits 


27203 


51 


42 


57 


54 


53 


42 


LOCAL-WEB-IIS Nimda.A attempt 


25038 


65 


61 


69 


64 


64 


62 


LOCAL-WEB-IIS CodeRed II attempt 


20418 


11 


7.5 


17 


22 


11 


7.1 


DNS zone transfer 


15575 


32 


35 


55 


55 


32 


36 


ICMP L3retriever Ping 


12908 


11 


12 


90 


90 


11 


12 


WEB-MISC http directory traversal 


10620 


41 


38 


46 


45 


41 


38 


(sppjtream4)STLTH ACT(SYN FIN scan) 


10182 


96 


90 


93 


93 


96 


96 
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Table 4. All 85 flows grouped by the number of alerts created and the percentage level 
below which busy intervals or alerts were flagged 



busy interval reduction alert redaction 



alerts 


5% 


10% 


50% 


100% 


alerts 


5% 


10% 


50% 


100% 


> lOOK 


5 


0 


0 


0 


> 100 K 


4 


0 


1 


0 


> lOK 


5 


3 


4 


1 


> lOK 


2 


1 


6 


4 


> IK 


0 


4 


19 


7 


> IK 


0 


1 


15 


14 


> 100 


0 


1 


12 


24 


> 100 


0 


0 


8 


29 


sum 


10 


8 


35 


32 


sum 


6 


2 


30 


47 



Table 4 summarizes alert reduction results with continuous model and smooth- 
ing factor 0.92. All 85 flows are grouped to four classes according to both their 
output volume (over 100, IK, 10 K or 100 K alerts) and the achieved reduction 
in busy intervals and alerts (below 5%, 10%, 50% or 100% of original), respec- 
tively. These results show also the poorer performance for flows below the 10 K 
limit. The busy intervals show more consistent relation between the volume and 
reduction. On the right hand side of Table 4 in the class over 100 K alerts, ICMP 
Dest Unr (Comm Admin Proh) stands out with reduction significantly smaller 
than others in the same class. We found two explanations for this behavior. 
First, there was one large alert impulse of approximately 17 K alerts flagged in 
the test data. This makes up roughly 10 % of flagged alerts. Second, the flow na- 
ture is more random compared to others, this is visible in Fig. 1(d) for learning 
data and applies also for the larger data set. This randomness causes more alert 
flagging, but still the reduction in busy intervals is comparable to other flows in 
this volume class. 

Reasons for poor summarization. There seems to be two main reasons for 
poorer performance. 1) Many flows had few huge alert peaks that increase the 
alert flagging significantly. 2) The intensity profile has the form of impulse train 
that has negative impact both on reduction of alerts and busy intervals. As the 
first cause does not increase remarkably the number of reported anomalous in- 
tervals i.e. the number of times the user is disturbed, this is smaller problem. 
However, the second cause renders our approach rather impractical for monitor- 
ing such a flow, as the operator is notified on most intervals showing activity. 
The flow (spp_stream4) on the last row of Tables 2 and 3 is a typical example, 
as its alert profile consisted only from impulses. In such situation a large major- 
ity of active intervals are flagged as anomalous. A closer look on alert impulses 
revealed that they were usually generated in such a short time interval that 
increasing the sampling frequency would not help much. Instead, other means 
should be considered to process them. 

Represented alert types. Amongst the most prolific signatures, we can iden- 
tify three main types of activity, hostile, information gathering and alerts that 
can be seen to reflect the dynamics of networks. 
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Hostile activity is represented by DDoS tool traffic and worms with five 
signatures. The two DDoS signatures are actually the same, different names were 
used by the operator for alert management reasons. If busy interval reduction 
below 5% with continuous model and (1 — A) = 0.92 is used to define EWMA 
monitoring applicable for a flow, then we have three fourths in feasible range for 
the hostile activity. 

In the system in question, possible information gathering is the most com- 
mon culprit for numerous alerts. This category can be further divided to informa- 
tion gathering on applications (web related signatures) and network architecture 
(ICMP, SNMP and DNS traffic). In both categories, there are both suitable and 
unsuitable flows for this type monitoring. 

The ICMP Destination Unreachable (Communication Administratively Pro- 
hibited) message is an example of the activity that describes the dynamics of the 
network. It reflects the network state in terms of connectivity, and the origins 
and causes of these events are generally out of operators control. 

Signatures firing on protocol anomalies can be considered as an orthogo- 
nal classification, since they can present any of the three types above, ((spp 
rpc decode), (spp stream4) and BAD TRAFFIC) were all poorly handled by the 
method. Another common factor is the smaller degree of presence in the data 
set in terms of non-zero intervals. As the (spp_stream4) means possible re- 
connaissance, and being present only on 29 intervals, it is less likely to be just 
background noise. 

The nature of these alerts and their volumes in general support the claim 
that large proportion of generated alerts can be considered as noise. Even in the 
case of hostile activity the originating events warrant aggregation. This applies 
in our case, but the situation may vary with different operating environments. 

Table 5 shows the signature flows ordered by their omnipresence giving the 
number of active intervals and the percentage this makes out of the whole testing 
interval. A rough division according to the 5 % watershed is made and type of 
signature according to above discussion is assigned. We can see that for all 
signatures showing activity on more than 45 % of the intervals the number of 
alerts issued to operator can be significantly reduced in this system. 

It would seem that the omnipresence of alerts would be better criteria than 
the alert type for determining whether EWMA monitoring would be useful or 
not. 

Impact of time slot choice. According to these metrics the usefulness of 
daily and weekday models was limited to a few exceptions, generally the con- 
tinuous model was performing as well as the others. We just happened to have 
one of the exceptions that really profited from hourly approach in our early ex- 
perimentations, and made the erroneous hypothesis of their commonness. The 
metrics are however limited for this kind of comparisons. It is especially diffi- 
cult to say if the hourly approach just marks more intervals as anomalous or 
is it actually capturing interesting artifacts differently. On many occasions the 
smaller reduction was at least partly due to abrupt intensity shifts. As several 
different statistics making up the hourly model signal an anomaly whereas the 
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Table 5. The omnipresence of signatures and their types. Presence measured in active 
intervals 



signature 


type 


<5% 


active 


present 


ICMP Dst Unr (Comm Adm Proh) 


network 


ok 


2578 


95% 


WEB-IIS view source via translate header 


info_web 


ok 


2548 


93% 


ICMP PING speedera 


info_net 


ok 


2456 


90% 


SNMP request udp 


info_net 


ok 


2311 


85% 


WEB-PHP content-disposition 


info_web 


ok 


2287 


84% 


ICMP PING WhatsupGold Windows 


info_net 


ok 


2069 


76% 


SQL Sapphire Worm (incoming) 


hostile 


ok 


1721 


63% 


LOCAL- WEB-IIS CodeRed II attempt 


hostile 


ok 


1229 


45% 


DNS zone transfer 


info_net 


no 


855 


31% 


WEB-MISC http directory traversal 


info_web 


no 


708 


26% 


Known DDOS Stacheldraht infection 


hostile 


ok 


563 


20% 


LOCAL- WEB-IIS Nimda.A attempt 


hostile 


no 


537 


19% 


DDOS Stacheldraht agent- /,handler (skillz) 


hostile 


ok 


512 


18% 


BAD TRAFEIC bad frag bits 


proto 


no 


432 


15% 


(spp_rpc_decode) Frag RPC Records 


proto 


no 


421 


15% 


(spp_rpc_decode) Incompl RPC segment 


proto 


no 


276 


10% 


ICMP LSretriever Ping 


info_net 


no 


107 


3% 


(sppjtream4)STLTH ACT(SYN FIN scan) 


proto 


no 


29 


1% 



continuously updated statistic does this only once. The two DDoS flows had 
intensity profiles resembling a step function, which caused the hourly model to 
flag significantly more alerts than the continuous. Another factor encumbering 
the comparisons are the differences in efficient lengths of model memories. As 
the time slot statistics of hourly and weekday models are updated with only the 
corresponding intensity measures the values averaged have longer span in real 
time. For example the hourly model’s statistics are affected by 8 or 24 days old 
measurements. 

Class flows. Grouping signature classes together increased the flagging percent- 
age. Table 6 shows obtained reductions with continuous model and (1 — A) = 0.92 
for class aggregates with more than 1000 alerts. In fact, almost every class con- 
tains one or more voluminous signatures that were problematic statistically al- 
ready by themselves, and this affects the behavior of class aggregate. The in- 
creased flagging could also indicate that anomalies in signature based flows with 
smaller volume are detected to some degree. The levels of busy intervals are re- 
duced relatively well and again generally the flagging increases as alert volume 
decreases. The aggregation by class might be used to gain even more abstrac- 
tion and higher level summaries in alert saturated situations. However, there are 
likely to be better criteria for aggregation than than the alert classes. 



Flow stability. To give an idea of the stability of flow profiles. Table 7 com- 
pares the alert and busy interval reduction obtained for four signatures used in 
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Table 6. The reduction in alerts and busy intervals when aggregating according to 
signature classes. Results for continuous model with 1 — A = 0.92 



flow 


raw 

int. alerts 


anomalous 
int. alerts 


misc-activity 


2678 


618273 


1.9% 


8.9% 


class_none 


1429 


380568 


4.8% 


18.3 % 


attempted-recon 


2635 


360613 


3.7% 


7.0% 


known-issue 


563 


308548 


1.7% 


1.1% 


web-application-activity 


2569 


88554 


3.3% 


16.3 % 


bad-unknown 


2559 


65883 


3.7% 


20.9 % 


known-trojan 


1511 


46014 


5.4% 


34.9 % 


misc-attack 


1727 


39070 


1.3% 


2.1% 


web-application-attack 


1017 


9587 


9.1% 


40.5 % 


attempted-user 


272 


3694 


19.4% 


40.6 % 


attempted-dos 


361 


2782 


24.3 % 


67.8 % 


attempted-admin 


444 


1760 


20.2 % 


33.1% 



Table 7 . A comparison of results obtained during learning and testing phases. (1 — A) = 
0.92 





alerts 


intervals 


flow 


learn. 


test 


learn. 


test 


SNMP request udp 


2.7 


3.5 


2.2 


3.5 


ICMP PING WhatsupGold Windows 


4.6 


3.6 


2.9 


3.6 


ICMP Dst Unr (Comm Adm Proh) 


12 


36 


3.2 


3.7 


ICMP PING speedera 


2.8 


3.2 


1.3 


2.0 



the learning phase against the reduction in testing data. In general the flagging is 
slightly higher in the training data set, but for ICMP Best Unr (Comm Adm Proh) 
significantly more alerts are marked anomalous in the test set. The large alert 
impulse in this flow, mentioned earlier, accounts for approximately 14% units of 
this increase in test data. Even if those alerts were removed, the increase would 
be large. Still, the reduction in busy intervals is quite similar, suggesting higher 
peaks in the test set. The fifth signature enforcing a local policy, also viewed 
in the learning phase, did not exist anymore in the testing data set. This sig- 
nature created alert impulses (see LOCAL-POLICY in Fig. 1(a)) and the alert 
reduction was marginal in learning data. 

It seems like with the used parameters the reduction performance stays al- 
most constant. This would suggest that after setting parameters meeting the 
operators needs, our approach is able to adapt to lesser changes in alert flow be- 
havior without further adjustment. At least during this test period, none of the 
originally nicely-enough-behaving flows changed to more problematic impulse- 
like nor vice versa. Also signatures having a constant alert flow or more random 
process type behavior, both feasible for the method, kept to their original profile. 
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To wrap up the results, it seems possible to use this approach to summarize 
and monitor the levels of high volume background noise seen by an IDS. Up to 
95 % of the one hour time slots showing activity from such an alert flow can 
be unburdened from the distraction. For the remaining intervals, instead of a 
barrage of alerts, only one alert would be outputted in the end of the interval. As 
both data sets came from the same system, the generality of these observations 
is rather limited, and more comprehensive testing would be required for further 
validation. 

If the user is worried that aggregation at signature level loses too much data, 
it is possible to use additional criteria, such as source and destination addresses 
and/or ports to have more focused alert streams. The reduction in aggregation is 
likely to create more flagged intervals, and this is a tradeoff that the user needs 
to consider according to his needs and the operating environment. Determining 
if the summarization masked important events in the test set was not possible, 
as we do not possess records of actual detected intrusions and problems in the 
monitored system against which we could compare our results. 



4 Related Work 

The focus of this work was only on volume reduction and alert aggregation, 
not on content improvement nor activity tracking. In addition, the target is 
high volume background noise instead of high impact alerts, so we consider the 
approach to be different from other correlation efforts, such as presented by 
Valdes and Skinner [14] or Qin and Lee [15]. 

Manganaris et al. [16] use data mining to gain better understanding of alert 
data. Julisch and Dacier [2] take the approach further and report episode rules 
a labor intensive approach and develop conceptual clustering to construct fil- 
ters for false positives. Instead of filtering, we propose to monitor the levels of 
background noise, if it is possible to significantly reduce the number of alerts 
displayed to the operator. 

In addition to a plethora of other applications, the EWMA model has also 
been harnessed for intrusion detection. There are two relatively recent ap- 
proaches, a non-named from Ye et al.[9] [10], and ArQoS^ developed by Ma- 
hadik et al. [11]. 

Both of these IDSes are meant mainly to detect Denial of Service (DoS) at- 
tacks. Methods proposed by Ye et al. use host-based data source, the Solaris 
BSM audit event intensity, for attack detection. ArQoS monitors DiffServ net- 
work’s Quality of Service (QoS) parameters, like bit rate, jitter and packet drop 
rate to detect attacks on QoS. So the latter can be said to be a network-based 
approach. 

Ye et al. test different control charts, in [9] one meant for autocorrelated, an- 
other for uncorrelated data, and [10] adds still one for the standard deviation, to 
find out the suitability for intrusion detection. Their conclusion was that all dif- 

http: / / arqos.csc.ncsu.edu 
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ferent charts could be used for detecting attacks causing statistically significant 
changes in event intensity. 

Mahadik et al. [11] rely on EWMA techniques only to analyze the more 
stationary flows. For the less stationary they deploy a statistic. Their control 
chart differs from those used by Ye et al., and according to their tests the overall 
system is capable to quickly detect QoS degradation. The approach we propose 
differs from these two in the following ways: 1) A view on system state instead 
of detecting DoS attacks or intrusions in more general is provided. 2) The audit 
source is the alert database created by network based sensor instead of host 
based or network traffic information. So no access to host audit trail or routers 
is required. 3) The control chart is defined slightly differently. 

5 Conclusions and Future Work 

Alerts triggered by activity not considered as actual attacks but only harmful 
tend to create huge amounts of alerts. Typically this kind of raw intelligence 
outputted by sensors is insignificant and distracting for the operator, but the 
changes in the levels of this background noise can be of interest. If these alerts 
are removed by simply Altering or judging them as false by correlation engine, 
the operator will lose this information. 

An alert processing method based on EWMA control charts to summarize 
the behavior of such alert flows was presented to meet the five objectives set 
in Sect. 1.3: anomaly highlighting, decreasing operator load, reduction measure- 
ment, determination of suitable flows for monitoring, and trend visualization. 

According to the experience with the technique, it can be used to highlight 
anomalies in high volume alert flows showing sufficient degree of regularity. With 
this approach it is possible to make the high alert levels associated with these 
flows more sustainable without deactivating them. We believe that the method 
could be used as such, or in complement to other means of correlation, to moni- 
tor alerts considered as background noise of an operational system. The provided 
additional diagnostic capabilities may be modest, but more importantly via sum- 
marization the operator can save time for more relevant tasks as he is informed 
only of significant changes in the noise level. A metric based on the proportion of 
time units freed from manual processing when monitoring an aggregate instead 
of raw alert flow was proposed. 

Alert flows creating less alerts or having strict requirements for the timeliness 
of detection are better to be treated with other means, since the sampling interval 
is scarce and the method is not able to And useful trends from small amount of 
alerts. 

As the method applicability for a particular flow is determined from its vi- 
sualization, this goal is not met, and requires the definition of explicit criteria 
and an automated process. Also the generation of meaningful alert summaries 
for the operator needs to be addressed for the the objective ’trend visualization’ 
to be met. 

For the moment, only the signatures and signature classes have been used as 
the aggregation criterion. The use of source and destination hosts or networks 
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could be a step towards more specific fiows, if required. We also intend to inves- 
tigate different similarity measures for forming the monitored aggregates. For 
example similar packet payload is one such possibility. 

Gathering user experience from the operators would be interesting, and for 
this the method is being integrated as part of alert console used internally at 
France Telecom. 
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Abstract. This paper provides an examination of an emerging class of security 
mechanisms often referred to as deception technologies or honeypots. It is 
based on our experience over the last four years designing and building a high, 
end commercial deception system called ManTrap. The paper will provide an 
overview of the various technologies and techniques and will examine the 
strengths and weaknesses of each approach. It will discuss deployment criteria 
and strategies and will provide a summary of our experiences designing and 
constructing these systems. It also presents the results of work demonstrating 
the feasibility and utility of a deep deception honeypot. 

Keywords: Deception, Honeypot, ManTrap. 



1 Introduction 

Over the past several years network systems have grown considerably in size, com- 
plexity, and susceptibility to attack. At the same time, the knowledge, tools, and tech- 
niques available to attackers have grown just as fast if not faster. Unfortunately defen- 
sive techniques have not grown as quickly. The current technologies are reaching 
their limitations and innovative solutions are required to deal with current and future 
classes of threats. Firewalls and intrusion detection/protection systems are valuable 
components of a security solution but they are limited in the information they can 
provide. While they can provide very broad protection for a large variety of services, 
they also provide very shallow protection. Even the solutions with the most complete 
“application protection” or “deep inspection” can determine very little about host-side 
effects, attacker capability, or attacker intent. In order to provide scalable detection 
for a wide variety of applications and systems, they cannot support the necessary 
environment in which to completely evaluate a potential threat. High bandwidth and 
network encryption also present barriers to such solutions. While host monitoring 
solutions (e.g. host intrusion detection systems, a.k.a. HIDS) do not suffer from all of 
these limitations, they are encumbered with their own. Using such host solutions 
poses significant management and scalability challenges and also places real assets at 
risk. Deception systems, also called “honeypots” present a valuable combination of 
these two approaches. 

This paper presents the results of our experience designing and deploying honey- 
pots for the last four years. It will first provide a basic overview of honeypot technol- 
ogy, including a classification system that we have developed and used. It will then 
examine deployment techniques and strategies we have used and observed. Finally, it 
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will discuss our specific experiences designing and constructing our honeypot, called 
ManTrap [1]. It will provide a detailed look at some of the design challenges and 
existing problems yet to be solved. 



2 Honeypot Basics 

A honeypot appears to be an attractive target to an attacker. These targets can be real 
systems or some type of emulator designed to appear as servers, desktops, network 
devices, etc. When the attacker attempts to attack the network, they either stumble 
into or are led into the honeypot. The honeypot then records all of the attacker’s ac- 
tions as they assess and attempt to compromise it. Depending on the specific class of 
honeypot it may provide additional functionality such as automated alerting, triggered 
responses, data analysis, and summary reporting. 

Using a honeypot has numerous advantages. First it wastes the attacker’s time. Any 
time spent attacking a honeypot is time not spent attacking a real machine. Second, it 
provides extremely detailed information about what the attacker does and how they 
do it. Third it gives the attacker a false impression of the existing security measures. 
Thus the attacker spends time finding tools to exploit the honeypot that may not work 
on a real system. Fourth the existence of a honeypot decreases the likelihood that a 
random attack or probe will hit a real machine. Finally, a honeypot has no false posi- 
tives. Any activity recorded is suspicious as a honeypot is not used for any other pur- 
pose. 

Much like a pot of honey used to attract and trap insects, a honeypot ensnares an 
attacker by appearing to be an attractive target. Depending on the depth of the decep- 
tion an attacker can spend large amounts of time attempting to exploit and then ex- 
ploring the honeypot. Meanwhile all this activity is recorded and reported to the 
honeypot owner. The more time the attacker spends with the honeypot the more in- 
formation about her means and motives is given to the owner. This information can be 
used to make other machines immune to the tools being used. 

If an attacker does not know the weaknesses of a system he cannot exploit it. 
Honeypots give attackers a false sense of accomplishment. They spend time research- 
ing the vulnerabilities presented by the honeypot. They create or find tools to exploit 
those vulnerabilities. Finally, they spend time executing these exploits and demon- 
strating to the honeypot owner exactly how to thwart their attack should it be viable 
on other machines. 

Many attackers scan large blocks of computers looking for victims. Even attackers 
targeting a specific organization will scan the publicly accessible machines owned by 
the organization looking for a machine to compromise as a starting point. Using 
honeypots decreases the chance an attacker will choose a valuable machine as a tar- 
get. A honeypot will detect and record the initial scan as well as any subsequent at- 
tack. 

Unlike other intrusion detection measures there are no false positives with a 
honeypot. All IDS systems produce false positives to varying degrees. This is because 
there is always a chance that valid traffic will match the characteristics the IDS uses 
to detect attacks. This is not the case with a honeypot. Any communication with a 
honeypot is suspect. This is because the honeypot is not used for any purpose other 
than detecting attacks. There is no valid traffic to produce false positives. 
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In this way a honeypot can detect more attacks than other IDS measures. New vul- 
nerabilities can be found and analyzed because all actions an attacker takes are re- 
corded. New attack tools can be detected based on their interaction with a honeypot. 
Since all communication is suspect, even new or unknown attacks which exhibit no 
signature or anomalous characteristics can be detected. These can include feeding 
false information into a service or database, using compromised credentials to gain 
unauthorized access, or exploiting some new application logic flaw. Finally, a honey- 
pot can detect and record incidents that may last for months. These so-called ‘slow 
scans’ are difficult to detect using an IDS as the time involved makes them very diffi- 
cult to differentiate from normal traffic without being false positive prone. 



3 Classification of Honeypots 

Honeypots are not a new idea. Researchers and security professionals have been using 
different forms of honeypots for many years [8] [9] [10]. In recent years however, there 
has been rapid innovation in the technology and significant increases in deployment. 
As honeypots become more mainstream, it is useful to discuss them in a slightly more 
formal sense. 

Honeypots can be classified into three primary categories: facades, sacrificial 
lambs, and instrumented systems. A facade is the most lightweight form of a honey- 
pot and usually consists of some type of simulation of an application service in order 
to provide the illusion of a victim system. A sacrificial lamb usually consists of an 
“off the shelf’ or “stock” system placed in a vulnerable location and left as a victim. 
An instrumented system honeypot is a stock system with additional modification to 
provide more information, containment, or control. 

Each class of honeypots has different strengths and weaknesses and is appropriate 
to different types of use according to these. The sections below explore each class 
with respect to implementation, strengths and weaknesses and typical uses. 

Note that while these classifications are primarily our creation, we have been using 
them with others in the field for a number of years. Other classification systems do 
exist [12], however ours attempts to provide more information of the honeypot (form, 
capability, risk, etc) rather than just the degree of interaction. 

3.1 Facades 

A facade honeypot is a system which provides a false image of a target host. It is most 
often implemented as software emulation of a target service or application. This emu- 
lation acts like a vulnerable host or service. Some implementations can emulate large 
numbers of hosts, varieties of operating systems, and different applications or ser- 
vices. When the facade is probed or attacked, it gathers information about the attacker 
and provides a fictitious response. This is analogous to having a locked door with 
nothing behind it and watching to see who attempts to open it. The depth of the simu- 
lation varies depending on implementation. Some will provide only partial application 
level behavior (e.g. banner presentation). Other implementations will actually simu- 
late the target service down as far as the network stack behavior. This is done in order 
to prevent remote signaturing by 0/S fingerprinting. The value of a facade honeypot 
is defined primarily by what systems and applications it can simulate and how easy to 
deploy and administrator it is. 
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Facades offer simple, easy deployment as they often have very minimal installation 
or equipment requirements and are easy to administer. They can provide a large num- 
ber of targets of considerable variety. Since they are not real systems, they do not 
have the vulnerabilities of real systems. They also present very little additional risk to 
your environment due to the nature of the emulation. While the system underneath is 
“real”, the emulated services are not. They cannot be compromised on the same fash- 
ion as they “live” services they emulate. Thus the honeypot cannot be used as a jump- 
ing off point. While it is technically possible that someone could attempt to actually 
exploit the emulated service (knowing that it is a honeypot) this seems very unlikely. 
At worst it simply merits caution in deployment. 

Their only significant limitation is that due to their limited depth, they provide only 
basic information about a potential threat. They may also fail to engage the attacker 
for long periods of time since there is not anything to compromise. This lack of depth 
can potentially create a signature which drives the attacker away from the honeypot. 
While this can be considered a limitation, by the time the attacker becomes suspi- 
cious, they have usually interacted with the honeypot enough to generate alerts, pro- 
vide intelligence, etc. 

Examples of this type of honeypot include NetFacade and Honeyd [2]. 

Sites that wish to deploy very simple deception as a form of early warning system 
should consider facade products given their simplicity to deploy and low administra- 
tive overhead. These are typically used by small to medium enterprises or by large 
enterprises in conjunction with other technology. While very little hard data exists to 
indicate the exact scale of this, our field experience supports this conclusion. 



3.2 Sacrificial Lambs 

A sacrificial lamb is a normal system left vulnerable to attack. They can be built from 
virtually any device (a Linux server, a Cisco router, etc). The typical implementation 
involves loading the operating system, configuring some applications and then leav- 
ing it on the network to see what happens. The administrator will examine the system 
periodically to determine if it has been compromised and if so what was done to it. In 
many cases, the only form of data collection used is a network sniffer deployed near 
the honeypot. While this provides a detailed trace of commands sent to the honeypot, 
it does not provide any data in terms of host effects. In other cases additional exami- 
nation is done either by hand or using various third-party forensic tools. Also the 
systems themselves are “live” and thus present a possible jumping off point for an 
attacker. Additional deployment considerations must be made to isolate and control 
the honeypot by means of firewalls or other network control devices. 

Sacrificial lambs provide real targets. All the results are exactly as they would be 
on a real system and there is no signature possible since there is nothing different 
about the system. These types of honeypots are also fairly simple to build locally 
since they only use off-the-shelf components. Sacrificial lambs provide a means to 
analyze a compromised system down to the last byte with no possible variation. How- 
ever, this type of honeypot requires considerable administrative overhead. The instal- 
lation and setup requires the administrator to load the operating system themselves 
and manually perform any application configuration or system hardening. The analy- 
sis is manual and often requires numerous third-party tools. They also do not provide 
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integrated containment or control facilities, so will require additional network consid- 
erations (as mentioned above) to deploy in most environments. 

There are no specific examples of sacrificial lambs since they can be constructed 
from virtually anything. However the Honeynet Project [3] provides good examples 
on constructing these. 

Groups or individuals that are interested in doing vulnerability research should 
consider a sacrificial lamb honeypot. It will require dedicated expert security re- 
sources to support but will provide a great deal of information and flexibility. 

3.3 Instrumented Systems 

Instrumented systems provide a compromise between the low cost of a facade and the 
depth of detail of a sacrificial lamb. They are implemented by modifying a stock sys- 
tem to provide additional data collection, containment, control and administration. 

Designed as an evolutionary step from earlier forms of deception, they provide 
easy to deploy and administer honeypots that are built on real systems. They are able 
to provide an exceptional level of detail (often more than a sacrificial lamb) while 
also providing integrated containment and control mechanisms. There are two impor- 
tant considerations when using instrumented systems. First is that building one can be 
very expensive and difficult to do correctly. It requires significant time, skill and 
knowledge to create even moderately good deception which is not detectable (e.g. a 
signature) or itself a security risk. Some administrators attempt to construct their own 
but often run into difficulty creating an effective deception, providing effective isola- 
tion, and providing sufficient management functionality. Sites interested in instru- 
mented systems should consider one designed by a security professional with signifi- 
cant honeypot experience and which is provided as a real software product (including 
support). 

An example of this type of honeypot would be Symantec’s ManTrap product. 

Sites interested in receiving more information than a facade provides but that can- 
not afford the large administrative overhead of a sacrificial lamb system should con- 
sider an instrumented system honeypot. These provide a richer integrated feature set 
and have taken into consideration scalability, deployment, reporting, and administra- 
tion. These are typically used by medium to large enterprise. 

3.4 Additional Considerations 

While not specific to a particular class or form of honeypot, there are a number of 
additional features or functions which should be considered by an organization evalu- 
ating honeypots. 

It is important to consider the nature and the cost of containment and control. Any 
system deployed in a network presents possible risk. Measures should be taken to 
mitigate that. Risk level, functionality, and restriction capability should be considered 
in any product that provides containment and control. If the product does not support 
any native containment and control, the cost and complexity of implementing it 
should be seriously considered. 

While honeypots can provide an excellent source of data, it is important to remem- 
ber that the data by itself does nothing. In order to be useful, the data must be ana- 
lyzed. Some products provide integrated analysis, reporting and alerting. Others re- 
quire the administrator to provide the data review and security expertise. How much 
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analysis is offered and how the administration is done is an important consideration 
and has significant impact on the cost of using such a system. 

Cluster or group administration functionality should be considered when deploying 
multiple deception devices. Systems which provide the ability to work in clusters and 
have single points of administration and reporting provide for a much more scalable 
solution than those that require manual operation of each node. 

Maintenance of content and restoration of the honeypot should also be taken into 
consideration. These both contribute to the ongoing administrative cost of maintain- 
ing a deception system. Content on a deception device needs to be periodically up- 
dated so it appears valid and “live”. Deception systems which have been attacked may 
also need to be periodically restored to a “clean” state. In both of these cases, solu- 
tions which provided automated capabilities for this can reduce administrative costs. 

Finally, it is worth considering the relationship of honeypots to host-based intru- 
sion detection systems (HIDS) [4] and integrity monitoring systems. HIDS are usually 
deployed on a production system and designed more as a burglar alarm. Running 
these on a production system really does not provide the same value as a honeypot. 
They are much more prone to false positives, force the administrator to deal with the 
difficulty of monitoring normal user activity, and generally do not provide contain- 
ment or good administration functionality (for a honeypot approach). These can be 
used to create honeypots, but often produce very large signatures since they are not 
designed for stealth. 

Integrity monitoring software has many of the same deficiencies as HIDS for 
honeypot use. It is designed for monitoring a production system for change, not user 
activity or security. It provides none of the additional functionality needed for a 
honeypot. As with a HIDS, these also create very large signatures (indications that 
this is not a normal system) that are not desirable for a honeypot. 



4 Deployment Strategies 

While many honeypot implementations may function well in single deployments with 
dedicated administrative efforts, larger deployments (a.k.a. "enterprise deployments") 
require additional functionality to be effective solutions. An organization that wishes 
to deploy honeypots should have an overall computer security policy that states what 
the threats are, what the main goals for an attacker might be where high-value systems 
are, and how potential targets will be protected. This security policy will dictate what 
the honeypot deployment strategy will be. 

This section describes a few different deployment strategies. These strategies, or 
combinations of them, can be used together with firewalls and IDS to form a cohesive 
security infrastructure to protect an organization. 

4.1 Minefield 

In a minefield deployment, honeypots are installed among live machines, possibly 
mirroring some of the real data. The honeypots are placed among external servers in 
the DMZ, to capture attacks against the public servers, and/or in the internal network, 
to capture internal attacks (which either originated internally or external attacks that 
penetrated the firewall and now use internal machines as launching pads). 
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Attacks are rarely restricted to a single machine. Many manual and automated net- 
work attacks follow the same pattern: Assuming a successful attack has taken place 
on one machine in the network, that machine is then used to scan the network for 
other potential targets, which are subsequently attacked. For manual attacks, this takes 
some time, while worms will normally execute the scan just seconds after the first 
infection [11]. The scanning can be done in a way to specifically avoid setting off IDS 
systems (e.g., through "slow scans"), but honeypots in a minefield will be alerted. 

For example, if a network has one honeypot for every four servers, then the chance 
of hitting a honeypot with a random, single-point attack is 20%. In reality, the chance 
is significantly better than that because in most cases an entire block of network ad- 
dresses will be scanned. When this happens, it is practically guaranteed that the 
honeypot will detect the intrusion shortly after any machine on the network has been 
compromised. 

Even though the intrusion detection aspect alone is important, another feature of 
using honeypots is to gain info on attack tools and purpose. With good security prac- 
tices on the production machines, weaker security on the honeypots may increase the 
chance that they will be the first machines that are attacked. A well-designed honey- 
pot will then have the information about what service was attacked, how that service 
was attacked, and - if the attack was successful - what the intruder did once inside. 
Having the honeypots configured exactly the same way as the regular servers, how- 
ever, has other advantages. It increases their deception value slightly, and it also 
means that when a honeypot has detected a successful attack, that attack is likely to 
succeed on the production hosts. 




Fig. 1. A “minefield” deployment 



4.2 Shield 

In a shield deployment, each honeypot is paired with a server it is protecting. While 
regular traffic to and from the server is not affected, any suspicious traffic destined for 
the server is instead handled by the honeypot shield. This strategy requires that a 
firewall/router filters the network traffic based on destination port numbers, and redi- 
rects the traffic according to the shielding policy. 

For instance, consider a web server deployed behind a firewall. Web server traffic 
will be directed to the web server IP address on TCP port 80. Any other traffic to the 
web server is considered suspicious, and can be directed to a honeypot. 
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The honeypot should be deployed in a DMZ, and to maximize the deception value, 
it may replicate some or all of the non-confidential content of the server it is shield- 
ing. In the example of the web server, this is merely a matter of mirroring some or the 
entire web content to the honeypot. 

In conjunction with the firewall or router, honeypots deployed in this fashion pro- 
vide actual intrusion prevention in addition to intrusion detection. Not only can poten- 
tial attacks be detected, they can be prevented by having the honeypot respond in 
place of the actual target of the attack. It should be added that a honeypot shield 
cannot protect a mail server from SMTP exploits, nor a web server from HTTP ex- 
ploits, since "regular" traffic must be able to reach its target. However, since live 
servers generally need very few open ports, it is reasonably easy to find the point of 
an attack - both for prevention and forensic purposes - and all other ports lead 
straight to the honeypot, where the attack can be analyzed in detail. 

A shield deployment is an example of how honeypots can protect a high-value sys- 
tem, where attacks can be expected. 




Fig. 2. A “shield” deployment 



4.3 Honeynet 

In a honeynet deployment, a network of honeypots imitates an actual or fictitious 
network. From an attacker’s point of view, the honeynet appears to have both servers 
and desktop machines, many different types of applications, and several different 
platforms. Another term for this deployment is “zoo”, as it captures the wild hacker in 
their natural environment. 

In a sense, a honeynet is an extension of the honeypot concept, in that it takes mul- 
tiple deception hosts (single honeypots), and turns them into an entire deception net- 
work. A typical honeynet may consist of many facades (because they are light-weight 
and reasonably easy to deploy), some instrumented systems for deep deception, and 
possibly some sacrificial lambs. In order to provide a reasonably realistic network 
environment, some sort of content generation is necessary. On a host basis, this in- 
volves simulating activity on each deep honeypot, as well as generating network traf- 
fic to and from the clients and servers, so that the network itself looks realistic from 
the outside. 
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In a small example of a DMZ that contains a web server and a mail server, con- 
sider two honeypots that act as shields to the servers. Any traffic to the web server 
that is not HTTP traffic will be directed to the web server’s shield. Any traffic to the 
mail server that is not SMTP will be directed to the mail server’s shield. By adding a 
few more honeypots, another dimension can be added to this deception; all traffic to 
unknown IP addresses can be directed to honeypots, not only traffic to known hosts. 
The strength of the honeynet shield is that it shields an entire network instead of a 
single host. Similarly, honeynet minefields represent the scenario where each mine is 
an entire network, as opposed to just a single honeypot. It is also possible to configure 
a honeypot so that any outbound traffic (e.g. the attacker trying to attack another sys- 
tem from the honeypot) can be directed only into an isolated honeynet. This provides 
both containment and the possibility of gathering additional and very useful informa- 
tion about the attacker’s attempts. 

Honeynets can be useful in a large enterprise environment, and offer a good early 
warning system for attacks. A honeynet may also provide an excellent way to figure 
out an intruder’s intention, by looking at what kind of machines and services are at- 
tacked, and what is done to them. The Honeynet Project (http://project.honeynet.org) 
is an excellent example of a honeynet used as a research tool to gather information 
about attacks on computer infrastructure. 




Fig. 3. A “honeynet” deployment 



5 Experiences Constructing 

and Deploying an Instrumented System 

ManTrap is a commercial honeypot product in the category of "instrumented sys- 
tems". It was originally developed by Recourse Technologies and is now a Symantec 
product. The remainder of this paper discusses our experience with ManTrap. We will 
first present a brief overview of its design and functionality and then discuss some of 
the challenges we faced in constructing and deploying it. Finally we will present a 
number of existing problems that have not yet been solved. 

We believe that these types of instrumented systems provide a useful, deployable 
tool for many organizations interested in using honeypots. Many of the design consid- 
erations made were intended to create a honeypot which was simple enough for most 
administrators to use, secure enough to deploy, and still deep enough to gather valu- 
able information about potential attacks. Our goal was to provide a professional qual- 
ity high interaction honeypot usable by a broad audience. 
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5.1 ManTrap Goals 

ManTrap was designed to be a commercially usable honeypot. While there are many 
ways to implement and deploy honeypots, most require far too much administrative 
overhead, far too much technical expertise, or create far too much risk to be deployed 
in most commercial environments. ManTrap’ s goal was to create a honeypot which 
could be easily deployed and maintained by a standard enterprise IT/security staff and 
provide valuable security data which could not be easily obtained from other existing 
tools. 



5.2 A Brief Overview 
High Level Architecture 

A ManTrap system consists of a single physical computer. ManTrap is installed on 
top of the operating system (Solaris) and provides operating system level virtualiza- 
tion of the system to implement its “honeypots”. Each machine can provide up to four 
different honeypots - or "cages" - with each cage being completely isolated from the 
other cages as well as from the real host system. A user logged into a cage will not be 
able to see the processes, network traffic, and other system resources of the other 
cages, nor of the host system itself. To the attacker, each cage appears to be a separate 
machine. If a system file is deleted in one cage, it will still exist in the others. 

If an attacker obtains access to a cage, whether by a stolen password, remote net- 
work exploit, or other means, the cage will provide a controlled environment where 
information is gathered about the activity, while at the same time containing the at- 
tacker, and stopping him from discovering that he is being monitored. 

ManTrap also provides a mechanism to automatically create and maintain dynamic 
content. While it is possible to initially load the system with a set of static content 
(e.g. web pages for a web server), content which changes over time provides a much 
more convincing deception to an attacker. ManTrap provides a module that automati- 
cally generates email traffic to and from some of the users on the system. This pro- 
vides an additional piece of deception, as an intruder may be fooled into thinking he is 
capturing actual email traffic. The generated email messages are instead created from 
templates provided by the ManTrap administrator. 

The ManTrap system also includes an administration console application. This ap- 
plication, built in Java, allows the user to remotely administer the ManTrap machines. 
It is possible to administer multiple ManTrap hosts from a single console. A cluster of 
ManTraps in an enterprise can therefore be managed by a single administrator. 

Audit 

ManTrap keeps extensive audit logs of activities in its cages. Since all activity in a 
cage is suspicious (because no legitimate users belong there), as much information as 
possible is logged. Examples of the activities that a running ManTrap will log: 

• All terminal input and output 

• All files opened for writing 

• All device accesses 

• All processes that are started 

• All network activity 
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The ManTrap logs are meant to provide an (almost) complete view of the activities 
inside the cage. ManTrap also allows the administrator to cryptographically verify 
that the logs have not been tampered with (see Audit Reliability below). 

Response 

When ManTrap detects cage activity, it is capable of alerting the administrator and/or 
responding automatically. The administrator can configure a response policy includ- 
ing: 

• SMTP (E-mail) alerts 

• SNMP traps (alerts to network management software) 

• Integration with other commercial threat management solutions (e.g. NIDS) 

• Custom responses: administrator-specified scripts or binaries to be run on a particu- 
lar event 

These responses can be used to alert administrators when a cage is accessed; to 
shutdown a cage once the attacker has achieved a certain level of access (e.g. gained 
root), etc. 

Analysis 

The log data that is collected inside a cage is used to provide different types of activ- 
ity reports. Reports can be generated on-demand or on a scheduled, regular basis, and 
cover cage activities such as: 

• File modifications 

• Successful logins to the cage 

• Responses triggered by the cage 

• Attempted connections 

• Outgoing connections 

• TCP and/or UDP port activity on the cage 

In addition, the ManTrap administration console allows a user to be able to monitor 
interactive sessions in a terminal window, either while the session is active, or after 
the fact. This gives the ManTrap administrator a unique and realistic view of what the 
intruder saw and did during the attack. 

5.3 Construction Experience 
General Technique 

As mentioned above, ManTrap is an instrumented system. It is constructed primarily 
by means of a kernel module that intercepts systems calls and provides filtering and 
modification. This is backed by a virtualized file system and various coordination and 
supporting administration processes. For example, if a process in a cage attempts to 
call open() to open /etc/passwd, the ManTrap module intercepts this call and redirects 
it so that the cage copy of the file is opened instead. 

Isolation 

ManTraps foremost requirement is that the cages be isolated from the root system and 
from each other. A process within the cage is not allowed to access files, directories, 
or devices except those explicitly exposed to it. A process within the cage is not al- 
lowed to interact with a process outside the cage. However, this must all be accom- 
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plished without causing compatibility problems for applications running inside the 
cage. One important feature of ManTrap was that the users (administrators) are al- 
lowed to run existing applications inside the deception environment without any 
modification. This was quite a challenge. While we were not universally successful as 
some applications require raw device support, require conflicting kernel functionality, 
or present some unacceptable privilege risk, in general ManTrap is able to provide 
this isolation while still maintaining compatibility. 

Stealth 

One of the key requirements which differentiate ManTrap from many existing host 
virtualization techniques (VMware [5], Solaris zones [6], etc) is stealth. ManTrap 
required that processes running within the virtualized environment, the cage, not be 
able to determine that it was not the “real” system. This required that all traces of 
monitoring, virtualization and other instrumentation be hidden. It also required that all 
activity in other cages on the same host be hidden. This included local files, running 
process lists, network data and many other things. This also needed to be done with- 
out causing compatibility functions for applications running inside the cages and 
without doing anything that would tip off an attacker. As there are numerous such 
interfaces in the operating system, and many of them not well documented, this 
proved to be one of the most significant challenges. It is also one in which there exists 
an adversarial pressure. Attackers (and sometimes researchers) would actively attempt 
to find ways in which the cage could be differentiated from the real system. While 
several techniques were eventually discovered nearly all were easily addressed. The 
only remaining ones were those which required root access and relied on accessing 
some hard-to-emulate resources such as /dev/kmem (see below Current Challenges). 

Audit Reliability 

Since one of the values of something like ManTrap is its ability to collect detailed 
data for use in analysis or potentially as evidence, the reliability of the data is very 
important. While the prior isolation requirement should provide a guarantee that an 
attacker inside the cage not be able to access or influence the audit trail, ManTrap was 
designed with an additional integrity control in its audit system. ManTraps are de- 
ployed with a hardware crypto-token called an iButton [7]. One of the tasks the sys- 
tem uses the iButton for is log signing. Periodically, ManTrap will sign its log files 
using functionality embedded in the token. If an attacker later succeeds in accessing 
the log files, any modifications they make can be easily detected since the signature 
validation will later fail. At best, such an attacker could delete the logs or portions of 
them. 

Cage Restoration 

One of the key features added to later version of ManTrap was the ability to easily 
restore a pristine cage image. A problem encountered with early version of ManTrap 
(and other honeypots) is that once an attacker has “compromised” the honeypot and 
made modifications, the cage is tainted. While it may be useful to maintain it in a 
tainted state for some period of time (so an attacker can return to what they believe is 
a compromised system), eventually the administrator may wish to restore the system 
to a clean state and begin again. This would allow them for example to clearly differ- 
entiate between what one attacker did and what subsequent attackers may do. It is a 
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very difficult task for an administrator to “undo” modifications made by an attacker, 
even assuming sufficient audit trail exists to reliably perform this task. While it is 
always possible to completely reload the system and perform all customization and 
configuration again, this are very time consuming tasks. ManTrap added functionality 
to allow administrators to easily restore configurations post installation and customi- 
zation. Thus restoring to a clean but configured and customized state is mostly a mat- 
ter of clicking a button. 

Automated Analysis 

Since ManTrap is intended to be used by administrators with limited security and 
systems expertise, it attempts to provide some level of automated analysis of the data 
it collects. In some cases this is merely presentation or basic aggregation of lower 
level data. In other cases it is application of a basic knowledge of security impact of 
common events. In the former case, ManTrap is able to reconstruct data from key- 
stroke traces into a session view of the attackers “terminal” for easy observation. In 
the latter case, it is able to make the determination that a root shell has been created 
from a non-root shell (without explicitly authenticating) and that it may possibly indi- 
cate use of a local privilege escalation exploit. While this is still a long ways from 
providing an “expert in a box”, it does succeed in lowering the amount expertise re- 
quired for use. Improvement in this area is discussed below. 

5.4 Current Challenges 

While we consider the ManTrap product a great success, there are still a number of 
open problems or challenges to be addressed to fully realize our original goals. We 
discuss four of the most significant below. 

Once an attacker has succeeded in obtaining root access, even emulated, it be- 
comes difficult to maintain some portions of our functionality; most notably stealth. 
While it is possible to prevent the “root” process in the cage from accessing external 
resources, in some cases this presents a significant signature. For example, consider 
the situation in which a root process attempts to access /dev/kmem directly. If the 
system disallows the access it presents a property which can be used as a signature. If 
access is allowed the system must virtualize this resource. Allowing access (e.g. via a 
pass-thru to the real /dev/kmem) would allow an intruder to see and possibly modify 
anything in memory, even things outside the cage. Unfortunately virtualizing some 
resources, like kernel memory, is quite difficult (maybe impossible) and not some- 
thing we have accomplished yet. 

Another difficulty we encountered in developing ManTrap is that, due to its design, 
it has a very high porting cost. Since many of the modifications performed to instru- 
ment the system are done using very platform specific interfaces and must emulate 
functionality which is very specific to a particular operating system, any port is al- 
most a complete rewrite. While administrative components and general design can be 
reused, much of the hard work (and the research necessary to design it) must be done 
for each operating system supported. Additionally some operating systems (e.g. Win- 
dows) differ enough in their basic architecture that considerable redesign must be 
done. 

One of the original goals was to reduce the expertise required to operate a honey- 
pot to increase the size of the potential user base. While we think the functionality 
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provided in ManTrap makes great progress in this area, there is still room for im- 
provement. While basic maintenance tasks are well automated and data presentation 
is easy to use, the system cannot perform much automated analysis. There would be 
considerable value in a system which could automatically assess attacker intent and 
skill level. Functionality which could automatically assess the nature, risk, and pur- 
pose of new files transferred onto the system (e.g. exploit kits) would also be very 
valuable. Automated analysis in general is a large and open area for computer security 
research, but there are a number of very honeypot specific tasks in which we envision 
future progress. 



6 Summary and Conclusions 

Our experience developing ManTrap validated our initial concept that it was possible 
to build such a deep instrumented system honeypot. It was possible by modifying the 
operating system using existing access points to provide for the needed isolation, 
stealth, and audit functionality. It was also possible to automate enough of the admin- 
istrative tasks to create a tool that was usable without considerable honeypot exper- 
tise. Our practical experience with the users revealed that most administrators capable 
of administering a Solaris system were also capable of administering a ManTrap. We 
did however discover that in many environments where it was desirable to deploy 
honeypots, even that level of expertise did not exist. We conclude that while we met 
our original design goals, this suggests there is a need to further reduce the adminis- 
trative complexity. 

Through numerous incidents, these honeypots proved to be valuable compliments 
to existing security infrastructure. They were able to detect attacks earlier than other 
systems, detect attacks other systems did not, and provide an extremely high level of 
data about the attackers, their methods and intent. We conclude that deception tech- 
nologies or honeypots are an important, emerging security technology. They provide 
the defender with both the time and information needed to effectively respond to a 
wide variety of threats. They are cost effective to deploy and administer and are capa- 
ble of detecting threats other detection technologies cannot. They provide a powerful 
defense mechanism that should be a component of any security solution. 
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Abstract. We present a payload-based anomaly detector, we call PAYL, for in- 
trusion detection. PAYL models the normal application payload of network 
traffic in a fully automatic, unsupervised and very effecient fashion. We first 
compute during a training phase a profile byte frequency distribution and their 
standard deviation of the application payload flowing to a single host and port. 
We then use Mahalanobis distance during the detection phase to calculate the 
similarity of new data against the pre-computed profile. The detector compares 
this measure against a threshold and generates an alert when the distance of the 
new input exceeds this threshold. We demonstrate the surprising effectiveness 
of the method on the 1999 DARPA IDS dataset and a live dataset we collected 
on the Columbia CS department network. In once case nearly 100% accuracy is 
achieved with 0.1% false positive rate for port 80 traffic. 



1 Introduction 

There are many IDS systems available that are primarily signature-based detectors. 
Although these are effective at detecting known intrusion attempts and exploits, they 
fail to recognize new attacks and carefully crafted variants of old exploits. A new 
generation of systems is now appearing based upon anomaly detection. Anomaly 
Detection systems model normal or expected behavior in a system, and detect devia- 
tions of interest that may indicate a security breach or an attempted attack. 

Some attacks exploit the vulnerabilities of a protocol, other attacks seek to survey 
a site by scanning and probing. These attacks can often be detected by analyzing the 
network packet headers, or monitoring the network traffic connection attempts and 
session behavior. Other attacks, such as worms, involve the delivery of bad payload 
(in an otherwise normal connection) to a vulnerable service or application. These may 
be detected by inspecting the packet payload (or the ill-effects of the worm payload 
execution on the server when it is too late after successful penetration). State of the 
art systems designed to detect and defend systems from these malicious and intrusive 
events depend upon “signatures” or “thumbprints” that are developed by human ex- 
perts or by semi-automated means from known prior bad worms or viruses. They do 
not solve the “zero-day” worm problem, however; the first occurrence of a new 
unleashed worm or exploit. 



E. Jonsson et al. (Eds.): RAID 2004, LNCS 3224, pp. 203-222, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 




204 Ke Wang and Salvatore J. Stolfo 



Systems are protected after a worm has been detected, and a signature has been 
developed and distributed to signature-based detectors, such as a virus scanner or a 
firewall rule. Many well known examples of worms have been described that propa- 
gate at very high speeds on the internet. These are easy to notice by analyzing the rate 
of scanning and probing from external sources which would indicate a worm propa- 
gation is underway. Unfortunately, this approach detects the early onset of a propaga- 
tion, but the worm has already successfully penetrated a number of victims, infected 
it and started its damage and its propagation. (It should be evident that slow and 
stealthy worm propagations may go unnoticed if one depends entirely on the detec- 
tion of rapid or bursty changes in flows or probes.) 

Our work aims to detect the first occurrences of a worm either at a network system 
gateway or within an internal network from a rogue device and to prevent its propa- 
gation. Although we cast the payload anomaly detection problem in terms of worms, 
the method is useful for a wide range of exploit attempts against many if not all ser- 
vices and ports. 

In this paper, the method we propose is based upon analyzing and modeling nor- 
mal payloads that are expected to be delivered to the network service or application. 
These normal payloads are specific to the site in which the detector is placed. The 
system first learns a model or profile of the expected payload delivered to a service 
during normal operation of a system. Each payload is analyzed to produce a byte 
frequency distribution of those payloads, which serves as a model for normal pay- 
loads. After this centroid model is computed during the learning phase, an anomaly 
detection phase begins. The anomaly detector captures incoming payloads and tests 
the payload for its consistency (or distance) from the centroid model. This is accom- 
plished by comparing two statistical distributions. The distance metric used is the 
Mahalanobis distance metric, here applied to a finite discrete histogram of byte value 
(or character) frequencies computed in the training phase. Any new test payload 
found to be too distant from the normal expected payload is deemed anomalous and 
an alert is generated. The alert may then be correlated with other sensor data and a 
decision process may respond with several possible actions. Depending upon the 
security policy of the protected site, one may filter, reroute or otherwise trap the net- 
work connection from being allowed to send the poison payload to the ser- 
vice/application avoiding a worm infestation. 

There are numerous engineering choices possible to implement the technique in a 
system and to integrate the detector with standard firewall technology to prevent the 
first occurrence of a worm from entering a secured network system. We do not ad- 
dress the correlation function and the mitigation strategies in this paper; rather we 
focus on the method of detection for anomalous payload. 

This approach can be applied to any network system, service or port for that site to 
compute its own “site-specific” payload anomaly detector, rather than being depend- 
ent upon others deploying a specific signature for a newly detected worm or exploit 
that has already damaged other sites. As an added benefit of the approach described 
in this paper, the method may also be used to detect encrypted channels which may 
indicate an unofficial secure tunnel is operating against policy. 
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The rest of the paper is organized as follows. Section 2 discusses related work in 
network intrusion detection. In Section 3 we describe the model and the anomaly 
detection technique. Section 4 presents the results and evaluations of the method 
applied to different sets of data and it’s run time performance. One of the datasets is 
publicly available for other researchers to verify our results. Section 5 concludes the 
paper. 



2 Related Work 

There are two types of systems that are called anomaly detectors: those based upon a 
specification (or a set of rules) of what is regarded as “good/normal” behavior, and 
others that learn the behavior of a system under normal operation. The first type relies 
upon human expertise and may be regarded as a straightforward extension of typical 
misuse detection IDS systems. In this paper we regard the latter type, where the be- 
havior of a system is automatically learned, as a true anomaly detection system. 

Rule-based network intrusion detection systems such as Snort and Bro use hand- 
crafted rules to identify known attacks, for example, vims signatures in the applica- 
tion payload, and requests to nonexistent services or hosts. Anomaly detection sys- 
tems such as SPADE [5], HIDES [6], PHAD [13], ALAD [12] compute (statistical) 
models for normal network traffic and generate alarms when there is a large deviation 
from the normal model. These systems differ in the features extracted from available 
audit data and the particular algorithms they use to compute the normal models. Most 
use features extracted from the packet headers. SPADE, ALAD and HIDES model 
the distribution of the source and destination IP and port addresses and the TCP con- 
nection state. PHAD uses many more attributes, a total of 34, which are extracted 
from the packet header fields of Ethernet, IP, TCP, UDP and ICMP packets. 

Some systems use some payload features but in a very limited way. HATE is simi- 
lar to PHAD; it treats each of the first 48 bytes as a statistical feature starting from the 
IP header, which means it can include at most the first 8 bytes of the payload of each 
network packet. ALAD models the incoming TCP request and includes as a feature 
the first word or token of each input line out of the first 1000 application payloads, 
restricted only to the header part for some protocols like HTTP and SMTP. 

The work of Kruegel et al [8] describes a service-specific intrusion detection sys- 
tem that is most similar to our work. They combine the type, length and payload dis- 
tribution of the request as features in a statistical model to compute an anomaly score 
of a service request. However, they treat the payload in a very coarse way. They first 
sorted the 256 ASCII characters by frequency and aggregate them into 6 groups: 0, 1- 
3, 4-6, 7-11, 12-15, and 16-255, and compute one single uniform distribution model 
of these 6 segments for all requests to one service over all possible length payloads. 
They use a chi-square test against this model to calculate the anomaly score of new 
requests. In contrast, we model the full byte distribution conditioned on the length of 
payloads and use Mahalanobis distance as fully described in the following discussion. 
Furthermore, the modeling we introduce includes automatic clustering of centroids 
that is shown to increase accuracy and dramatically reduce resource consumption. 
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The method is fully general and does not require any parsing, discretization, aggrega- 
tion or tokenizing of the input stream (eg, [14]). 

Network intrusion detection systems can also be classified according to the seman- 
tic level of the data that is analyzed and modeled. Some of the systems reconstruct the 
network packets and extract features that describe the higher level interactions be- 
tween end hosts like MAD AMID [9], Bro [15], EMERALD [18], STAT [24], ALAD 

[13] , etc. Eor example, session duration time, service type, bytes transferred, and so 
forth are regarded as higher level, temporally ordered features not discernible by 
inspecting only the packet content. Other systems are purely packet-based like PHAD 

[14] , NATED [12], NATE [23]. They detect anomalies in network packets directly 
without reconstruction. This approach has the important advantage of being simple 
and fast to compute, and they are generally quite good at detecting those attacks that 
do not result in valid connections or sessions, for example, scanning and probing 
attacks. 

3 Payload Modeling and Anomaly Detection 

There are many design choices in modeling payload in network flows. The primary 
design criteria and operating objectives of any anomaly detection system entails: 

• automatic “hands-free” deployment requiring little or no human intervention, 

• generality for broad application to any service or system, 

• incremental update to accommodate changing or drifting environments, 

• accuracy in detecting truly anomalous events, here anomalous payload, with low 
(or controllable) false positive rates, 

• resistance to mimicry attack and 

• efficiency to operate in high bandwidth environments with little or no impact on 
throughput or latency. 

These are difficult objectives to meet concurrently, yet they do suggest an ap- 
proach that may balance these competing criteria for payload anomaly detection. 

We chose to consider “language-independent” statistical modeling of sampled data 
streams best exemplified by well known n-gram analysis. Many have explored the 
use of n-grams in a variety of tasks. The method is well understood, efficient and 
effective. The simplest model one can compose is the 1-gram model. A 1-gram model 
is certainly efficient (requiring a linear time scan of the data stream and an update of 
a small 256-element histogram) but whether it is accurate requires analysis and ex- 
perimentation. To our surprise, this technique has worked surprisingly well in our 
experiments as we shall describe in Section 4. Eurthermore, the method is indeed 
resistant to mimicry attack. Mimicry attacks are possible if the attacker has access to 
the same information as the victim to replicate normal behavior. In the case of appli- 
cation payload, attackers (including worms) would not know the distribution of the 
normal flow to their intended victim. The attacker would need to sniff for a long pe- 
riod of time and analyze the traffic in the same fashion as the detector described 
herein, and would also then need to figure out how to pad their poison payload to 
mimic the normal model. 
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3.1 Length Conditioned n-Gram Payload Model 

Network payload is just a stream of bytes. Unlike the network packet headers, pay- 
load doesn’t have a fixed format, small set of keywords or expected tokens, or a lim- 
ited range of values. Any character or byte value may appear at any position of the 
datagram stream. To model the payload, we need to divide the stream into smaller 
clusters or groups according to some criteria to associate similar streams for model- 
ing. The port number and the length are two obvious choices. We may also condition 
the models on the direction of the stream, thus producing separate models for the 
inbound traffic and outbound responses. 

Usually the standard network services have a fixed pre-assigned port number: 20 
for FTP data transmission, 21 for FTP commands, 22 for SSFI, 23 for Telnet, 25 for 
SMTP, 80 for Web, etc. Each such application has its own special protocol and thus 
has its own payload type. Each site running these services would have its own “typi- 
cal payload” flowing over these services. Payload to port 22 should be encrypted and 
appear as uniform distribution of byte values, while the payload to port 21 should be 
primarily printable characters entered by a user and a keyboard. 

Within one port, the payload length also varies over a large range. The most com- 
mon TCP packets have payload lengths from 0 to 1460. Different length ranges have 
different types of payload. The larger payloads are more likely to have non-printable 
characters indicative of media formats and binary representations (pictures, video 
clips or executable files etc.). Thus, we compute a payload model for each different 
length range for each port and service and for each direction of payload flow. This 
produces a far more accurate characterization of the normal payload than would oth- 
erwise be possible by computing a single model for all traffic going to the host. How- 
ever, many centroids might be computed for each possible length payload creating a 
detector with a large resource consumption. 

To keep our model simple and quick to compute, we model the payload using n- 
gram analysis, and in particular the byte value distribution, exactly when n=l. An n- 
gram is the sequence of n adjacent bytes in a payload unit. A sliding window with 
width n is passed over the whole payload and the occurrence of each n-gram is 
counted. N-gram analysis was first introduced by [2] and exploited in many language 
analysis tasks, as well as security tasks. The seminal work of Forrest [3] on system 
call traces uses a form of n-gram analysis (without the frequency distribution and 
allowing for “wildcards” in the gram) to detect malware execution as uncharacteristic 
sequences of system calls. 

For a payload, the feature vector is the relative frequency count of each n-gram 
which is calculated by dividing the number of occurrences of each n-gram by the total 
number of n-grams. The simplest case of a 1-gram computes the average frequency of 
each ASCII character 0-255. Some stable character frequencies and some very variant 
character frequencies can result in the same average frequency, but they should be 
characterized very differently in the model. Thus, we compute in addition to the mean 
value, the variance and standard deviation of each frequency as another characteriz- 
ing feature.. So for the payload of a fixed length of some port, we treat each charac- 
ter’s relative frequency as a variable and compute its mean and standard deviation as 
the payload model. 
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Figure 1 provides an example showing how the payload byte distributions vary 
from port to port, and from source and destination flows. Each plot represents the 
characteristic profile for that port and flow direction (inbound/outbound). Notice also 
that the distributions for ports 22 (inbound and outbound) show no discernible pat- 
tern, and hence the statistical distribution for such encrypted channels would entail a 
more uniform frequency distribution across all of the 256 byte values, each with low 
variance. Hence, encrypted channels are fairly easy to spot. Notice that this figure is 
actually generated from a dataset with only the first 96 bytes of payload in each 
packet, and there is already a very clear pattern with the truncated payload. Figure 2 
displays the variability of the frequency distributions among different length pay- 
loads. The two plots characterize two different distributions from the incoming traffic 
to the same web server, port 80 for two different lengths, here payloads of 200 bytes, 
the other 1,460 bytes. Clearly, a single monolithic model for both length categories 
will not represent the distributions accurately. 




Fig. 1. Example byte distributions for 
different ports. For each plot, the X-axis 
is the ASCII byte 0-255, and the Y-axis is 
the average hyte frequency 



Fig. 2. Example byte distribution for 
different payload lengths for port 80 on the 
same host server 



Given a training data set, we compute a set of models M-. For each specific ob- 
served length i of each port j, M.j stores the average byte frequency and the standard 
deviation of each byte’s frequency. The combination of the mean and variance of 
each byte’s frequency can characterize the payload within some range of payload 
lengths. So if there are 5 ports, and each port’s payload has 10 different lengths, there 
will be in total 50 centroid models computed after training. As an example, we show 
the model computed for the payload of length 185 for port 80 in figure 3, which is 
derived from a dataset described in Section 4. (We also provide an automated means 
of reducing the number of centroids via clustering as described in section 3.4.) 

PAYL operates as follows. We first observe many exemplar payloads during a 
training phase and compute the mean and variance of the byte value distribution pro- 
ducing model M . During detection, each incoming payload is scanned and its byte 
value distribution is computed. This new payload distribution is then compared 
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against model M-', if the distribution of the new payload is significantly different from 
the norm, the detector flags the packet as anomalous and generates an alert. 

The means to compare the two distributions, the model and the new payload, is de- 
scribed next. 




Fig. 3. The average relative frequency of each byte, and the standard deviation of the frequency 
of each byte, for payload length 185 of port 80 



3.2 Simplified Mahalanobis Distance 

Mahalanobis distance is a standard distance metric to compare two statistical distribu- 
tions. It is a very useful way to measure the similarity between the (unknown) new 
payload sample and the previously computed model. Here we compute the distance 
between the byte distributions of the newly observed payload against the profile from 
the model computed for the corresponding length range. The higher the distance 
score, the more likely this payload is abnormal. 

The formula for the Mahalanobis distance is: 

d^(x, y) = (x- yf C~\x - y) 

where X and y are two feature vectors, and each element of the vector is a variable. 
X is the feature vector of the new observation, and y is the averaged feature vector 
computed from the training examples, each of which is a vector. And C ' is the in- 
verse covariance matrix as C- = Cov{yi,y j) . T,-, yj are the ith and jth elements of 
the training vector. 

The advantage of Mahalanobis distance is that it takes into account not only the 
average value but also its variance and the covariance of the variables measured. 
Instead of simply computing the distance from the mean values, it weights each vari- 
able by its standard deviation and covariance, so the computed value gives a statisti- 
cal measure of how well the new example matches (or is consistent with ) the training 
samples. 

In our problem, we use the “naive” assumption that the bytes are statistically inde- 
pendent. Thus, the covariance matrix C becomes diagonal and the elements along the 
diagonal are just the variance of each byte. 
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Notice, when computing the Mahalanobis distance, we pay the price of having to 
compute multiplications and square roots after summing the differences across the 
byte value frequencies. To further speed up the computation, we derive the simplified 
Mahalanobis distance: 

where the variance is replaced by the standard deviation. Here n is fixed to 256 under 
the 1-gram model (since there are only 256 possible byte values). Thus, we avoid the 
time-consuming square and square-root computations (in favor of a single division 
operation) and now the whole computation time is linear in the length of the payload 
with a small constant to compute the measure. This produces an exceptionally fast 
detector (recall our objective to operate in high-bandwidth environments). 

For the simplified Mahalanobis distance, there is the possibility that the standard 

deviation (T, equals zero and the distance will become infinite. This will happen 

when a character or byte value never appears in the training samples or, oddly 
enough, it appears with exactly the same frequency in each sample. To avoid this 
situation, we give a smoothing factor OC to the standard deviation similar to the prior 
observation: 

d{x, y) = ^"J(| X,. - y,. |/((7,. -f a)) 

The smoothing factor OC reflects the statistical confidence of the sampled training 
data. The larger the value of OC , the less the confidence the samples are truly repre- 
sentative of the actual distribution, and thus the byte distribution can be more vari- 
able. Over time, as more samples are observed in training, OC may be decremented 
automatically. 

The formula for the simplified Mahalanobis distance also suggests how to set the 
threshold to detect anomalies. If we set the threshold to 256, this means we allow 
each character to have a fluctuation range of one standard deviation from its mean. 
Thus, logically we may adjust the threshold to a value in increments of 128 or 256, 
which may be implemented as an automatic self-calibration process. 

3.3 Incremental Learning 

The 1-gram model with Mahalanobis distance is very easy to implement as an incre- 
mental version with only slightly more information stored in each model. An incre- 
mental version of this method is particularly useful for several reasons. A model may 
be computed on the fly in a “hands-free” automatic fashion. That model will improve 
in accuracy as time moves forward and more data is sampled. Furthermore, an incre- 
mental online version may also “age out” old data from the model keeping a more 
accurate view of the most recent payloads flowing to or from a service. This “drift in 
environment” can be solved via incremental or online learning [25]. 

To age out older examples used in training the model, we can specify a decay pa- 
rameter of the older model and emphasize the frequency distributions appearing in 
the new samples. This provides the means of automatically updating the model to 
maintain an accurate view of normal payloads seen most recently. 
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To compute the incremental version of the Mahalanobis distance, we need to com- 
pute the mean and the standard deviation of each ASCII character seen for each new 
sample observed. For the mean frequency of a character, we compute 

X = ^ ^ Xj j N from the training examples. If we also store the number of samples 

^ _j_ ^ 

processed, N, we can update the mean as x = — x-\ — — when we 

N + l N + l 

see a new example , a clever update technique described by Knuth [7]. 

Since the standard deviation is the square root of the variance, the variance compu- 
tation can be rewritten using the expected value E as: 

Var(X) = E(X-EXf = E(X^)-(EXf 

We can update the standard deviation in a similar way if we also store the average 

2 

of the Xi in the model. 

This requires maintaining only one more 256-element array in each model that 
2 

stores the average of the X^ and the total number of observations N. Thus, the n-gram 

byte distribution model can be implemented as an incremental learning system easily 
and very efficiently. Maintaining this extra information can also be used in clustering 
samples as described in the next section. 

3.4 Reduced Model Size by Clustering 

When we described our model, we said we compute one model M - for each ob- 
served length bin i of payloads sent to port j. Such fine-grained modeling might in- 
troduce several problems. First, the total size of the model can become very large. 
(The payload lengths are associated with media files that may be measured in giga- 
bytes and many length bins may be defined causing a large number of centroids to be 
computed.) Further, the byte distribution for payloads of length bin i can be very 
similar to that of payloads of length bins i-1 and i+1; after all they vary by one byte. 
Storing a model for each length may therefore be obviously redundant and wasteful. 

Another problem is that for some length bins, there may not be enough training 
samples. Sparseness implies the data will generate an empirical distribution that will 
be an inaccurate estimate of the true distribution leading to a faulty detector. 

There are two possible solutions to these problems. One solution for the sparseness 
problem is relaxing the models by assigning a higher smoothing factor to the standard 
deviations which allows higher variability of the payloads. The other solution is to 
“borrow” data from neighboring bins to increase the number of samples; i.e. we use 
data from neighboring bins used to compute other “similar” models. 

We compare two neighboring models using the simple Manhattan distance to 
measure the similarity of their average byte frequency distributions. If their distance 
is smaller than some threshold t, we merge those two models. This clustering tech- 
nique is repeated it until no more neighboring models can be merged. This merging is 
easily computed using the incremental algorithm described in Section 3.3; we update 
the means and variances of the two models to produce a new updated distribution. 
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Now for a new observed test data with length i sent to port j, we use the model M-, 
or the model it was merged with. But there is still the possibility that the length of the 
test data is outside the range of all the computed models. For such test data, we use 
the model whose length range is nearest to that of the test data. In these cases, the 
mere fact that the payload has such an unusual length unobserved during training may 
itself be cause to generate an alert. 

The reader should note that the modeling algorithm and the model merging process 
are each linear time computations, and hence the modeling technique is very fast and 
can be performed in real time. The online learning algorithm also assures us that 
models will improve over time, and their accuracy will be maintained even when 
services are changed and new payloads are observed. 

3.5 Unsupervised Learning 

Our model together with Mahalanobis distance can also be applied as an unsuper- 
vised learning algorithm. Thus, training the models is possible even if noise is present 
in the training data (for example, if training samples include payloads from past 
worm propagations still propagating on the internet.) This is based on the assumption 
that the anomalous payload is a minority of the training data and their payload distri- 
bution is different from the normal payload. These abnormal payloads can be identi- 
fied in the training set and their distributions removed from the model. This is ac- 
complished by applying the learned models to the training dataset to detect outliers. 
Those anomalous payloads will have a much larger distance to the profile than the 
“average” normal samples and thus will likely appear as statistical outliers. After 
identifying these anomalous training samples, we can either remove the outliers and 
retrain the models, or update the frequency distributions of the computed models by 
removing the counts of the byte frequencies appearing in the anomalous training data. 
We demonstrate the effectiveness of these techniques in the evaluation section. 

3.6 Z-String 

Consider the string of bytes corresponding to the sorted, rank ordered byte frequency 
of a model. Figure 4 displays a view of this process. The frequency distribution of 
payloads of length 185 is plotted in the top graph. The lower graph represents the 
same information by the plot is reordered to the rank ordering of the distribution. 
Here, the first bar in the lower plot is the frequency of the most frequently appearing 
ASCII character. The second bar is likewise the second most frequent, and so on. 
This rank ordered distribution surprisingly follows a Zipf-like distribution (an expo- 
nential function or a power law where there are few values appearing many times, 
and a large number of values appearing very infrequently.) 

The rank order distribution also defines what we call a “Z-string”. The byte values 
ordered from most frequent to least frequent serves as a representative of the entire 
distribution. Figure 5 displays the Z-String for the plot in Figure 4. Notice that for 
this distribution there are only 83 distinct byte values appearing in the distribution. 
Thus, the Z-string has length 83. 
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Furthermore, as we shall see later, this rank ordered byte value distribution of the 
new payload deemed anomalous also may serve as a simple representation of a “new 
worm signature” that may be rapidly deployed to other sites to better detect the ap- 
pearance of a new worm at those sites; if an anomalous payload appears at those sites 
and its rank ordered byte distribution matches a Z-string provided from another site, 
the evidence is very good that a worm has appeared. This distribution mechanism is 
part of an ongoing project called “Worminator” [11, 22] that implements a “collabo- 
rative security” system on the internet. A full treatment of this work is beyond the 
scope of this paper, but the interested reader is encouraged to visit http;//worminator. 
cs.columbia.edu/ for details. 




Fig. 4. Payload distribution appears in the top plot, re-ordered to the rank-ordered count fre- 
quency distrihution in the bottom plot. Notice there are only 83 distinct characters used in the 
average payload for this service (port 80, http) for this length distrihution of payloads (all pay- 
loads with length 1 85 hytes) 



eto.c/a CxP lsrw:imnTupghhH|- 
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(X : LF - Line feed [5 : CR - Carriage return 

Fig. 5. The signature “Z-string” for the average payload displayed in Figure 4. “e” is the most 
frequent hyte value, followed by “t” and so on. Notice how balanced characters appear adjacent 
to each other, for example “()” and “[]” since these tend to appear with equal frequency 



4 Evaluation of the 1-Gram Models 

We conducted two sets of experiments to test the effectiveness of the 1-gram models. 
The first experiment was applied to the 1999 DARPA IDS Data Set which is the most 
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complete dataset with full payload publicly available for experimental use. The ex- 
periment here can be repeated by anyone using this data set to verify the results we 
report. The second experiment used the CUCS dataset which is the inbound network 
traffic to the web server of the computer science department of Columbia University. 
Unfortunately, this dataset cannot be shared with other researchers due to the privacy 
policies of the university. (In fact, the dataset has been erased to avoid a breach of 
anyone’s privacy.) 

4.1 Experiments with 1999 DARPA IDS Data Set 

The 1999 DARPA IDS data set was collected at MIT Lincoln Labs to evaluate intru- 
sion detection systems. All the network traffic including the entire payload of each 
packet was recorded in tcpdump format and provided for evaluation. In addition, 
there are also audit logs, daily file system dumps, and BSM (Solaris system call) logs. 
The data consists of three weeks of training data and two weeks of test data. In the 
training data there are two weeks of attack-free data and one week of data with la- 
beled attacks. 

This dataset has been used in many research efforts and results of tests against this 
data have been reported in many publications. Although there are problems due to 
the nature of the simulation environment that created the data, it still remains a useful 
set of data to compare techniques. The top results were reported by [10]. 

In our experiment on payload anomaly detection we only used the inside network 
traffic data which was captured between the router and the victims. Because most 
public applications on the Internet use TCP (web, email, telnet, and ftp), and to re- 
duce the complexity of the experiment, we only examined the inbound TCP traffic to 
the ports 0-1023 of the hosts 172.016.xxx.xxx which contains most of the victims, 
and ports 0-1023 which covers the majority of the network services. For the DARPA 
99 data, we conducted experiments using each packet as the data unit and each con- 
nection as the data unit. We used tcptrace to reconstruct the TCP connections from 
the network packets in the tcpdump files. We also experimented the idea of “trun- 
cated payload”, both for each packet and each connection. For truncated packets, we 
tried the first N bytes and the tail N bytes separately, where N is a parameter. Using 
truncated payload saves considerable computation time and space. We report the 
results for each of these models. 

We trained the payload distribution model on the DARPA dataset using week 1 (5 
days, attack free) and week 3 (7 days, attack free), then evaluate the detector on 
weeks 4 and 5, which contain 201 instances of 58 different attacks, 177 of which are 
visible in the inside tcpdump data. Because we restrict the victims’ IP and port range, 
there are 14 others we ignore in this test. 

In this experiment, we focus on TCP traffic only, so the attacks using UDP, ICMP, 
ARP (address resolution protocol) and IP only cannot be detected. They include: 
smurf (ICMP echo-reply flood), ping-of-death (over-sized ping packets), UDPstorm, 
arppoison (corrupts ARP cache entries of the victim), selfping, ipsweep, teardrop 
(mis-fragmented UDP packets). Also because our payload model is computed from 
only the payload part of the network packet, those attacks that do not contain any 
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payload are impossible to detect with the proposed anomaly detector. Thus, there are 
in total 97 attacks to be detected by our payload model in weeks 4 and 5 evaluation 
data. 

After filtering there are in total 2,444,591 packets, and 49556 connections, with 
non-zero length payloads to evaluate. We build a model for each payload length ob- 
served in the training data for each port between 0-1023 and for every host machine. 
The smoothing factor is set to 0.001 which gives the best result for this dataset (see 
the discussion in Section 3.2). This helps avoid over-fitting and reduces the false 
positive rate. Also due to having an inadequate number of training examples in the 
DARPA99 data, we apply clustering to the models as described previously. Cluster- 
ing the models of neighboring length bins means that similar models can provide 
more training data for a model whose training data is too sparse thus making it less 
sensitive and more accurate. But there is also the risk that the detection rate will be 
lower when the model allows more variance in the frequency distributions. Based on 
the models for each payload length, we did clustering with a threshold of 0.5, which 
means if the two neighboring model’s byte frequency distribution has less than 0.5 
Manhattan distance we merge their models. We experimented with both unclustered 
and clustered models. The results indicate that the clustered model is always better 
than the unclustered model. So in this paper, we will only show the results of the 
clustered models. 

Different port traffic has different byte variability. For example, the payload to 
port 80 (HTTP requests) are usually less variable than that of port 25 (email). Hence, 
we set different thresholds for each port and check the detector’s performance for 
each port. The attacks used in the evaluation may target one or more ports. Hence, we 
calibrate a distinct threshold for each port and generate the ROC curves including all 
appropriate attacks as ground truth. The packets with distance scores higher than the 
threshold are detected as anomalies. 

Figure 6 shows the ROC curves for the four most commonly attacked ports: 21, 
23, 25, and 80. For the other ports, eg. 53, 143, 513 etc., the DARPA99 data doesn’t 
provide a large enough training and testing sample, so the results for those ports are 
not very meaningful. 

For each port, we used five different data units, for both training and testing. The 
legend in the plots and their meaning are: 

1) Per Packet Model, which uses the whole payload of each network packet; 

2) First 100 Packet Model, which uses the first 100 bytes of each network packet; 

3) Tail 100 Packet Model, which uses the last 100 bytes of each network packet; 

4) Per Conn Model, which uses the whole payload of each connection; 

5) Truncated Conn Model, which uses the first 1000 bytes of each connection. 

From Figure 6 we can see that the payload-based model is very good at detecting 
the attacks to port 21 and port 80. For port 21, the attackers often first upload some 
malicious code onto the victim machine and then login to crash the machine or get 
root access, like casesen and sechole. The test data also includes attacks that up- 
load/download illegal copies of software, like warezmaster and warezclient. These 
attacks were detected easily because of their content which were rarely seen executa- 
ble code and quite different from the common files going through FTP. For port 80, 
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the attacks are often malformed HTTP requests and are very different from normal 
requests. For instance, crashiis sends request “GET apache2 sends request with 
a lot of repeated “User-Agent:siouxVr\n”, etc. Using payload to detect these attacks is 
a more reliable means than detecting anomalous headers simply because their packet 
headers are all normal to establish a good connection to deliver their poison payload. 
Connection based detection has a better result than the packet based models for port 
21 and 80. It’s also important to notice that the truncated payload models achieve 
results nearly as good as the full payload models, but are much more efficient in time 
and space. 




Port 25 - False Positive Rate (%) Port SO • False Positive Rate (%) 

Fig. 6. ROC curves for ports 21, 23, 25, 80 for the five different models. Notice the x-axis scale 
is different for each plot and does not span to 100%, but limited to the worst false positive rate 
for each plot 



For port 23 and port 25 the result is not as good as the models for port 21 and 80. 
That’s because their content are quite free style and some of the attacks are well hid- 
den. For example, the framespoofer attack is a fake email from the attacker that mis- 
directs the victim to a malicious web site. The website URL looks entirely normal. 
Malformed email and telnet sessions are successfully detected, like the perl attack 
which runs some bad perl commands in telnet, and the sendmail attack which is a 
carefully crafted email message with an inappropriately large MIME header that 
exploits a buffer overflow error in some versions of the sendmail program. For these 
two ports, the packet-based models are better than the connection-based models. This 
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is likely due to the fact that the actual exploit is huried within the larger context of the 
entire connection data, and its particular anomalous character distribution is swamped 
by the statistics of the other data portions of the connection. The per packet model 
detects this anomalous payload more easily. 

There are many attacks that involve multiple steps aimed at multiple ports. If we 
can detect one of the steps at any one port, then the attack can be detected success- 
fully. Thus we correlate the detector alerts from all the ports and plot the overall per- 
formance. When we restrict the false positive rate of each port (during calibration of 
the threshold) to be lower than 1%, we achieve about a 60% detection rate, which is 
pretty high for the DARPA99 dataset. The results for each model are displayed in the 
Table 1: 

Table 1. Overall detection rate of each model when false positive rate lower than 1% 



Per Packet Model 


57/97 (58.8%) 


First 100 Packet Model 


55/97 (56.7%) 


Tail 100 Packet Model 


46/97 (47.4%) 


Per Conn Model 


55/97 (56.7%) 


Tmncated Conn Model 


51/97 (52.6%) 



Modeling the payload to detect anomalies is useful to protect servers against new 
attacks. Furthermore, careful inspection of the detected attacks in the tables and from 
other sources reveals that correlating this payload detector with other detectors in- 
creases the coverage of the attack space. There is large non-overlap between the at- 
tacks detected via payload and other systems that have reported results for this same 
dataset, for example PHAD [13]. This is obvious because the data sources and model- 
ing used are totally different. PHAD models packet header data, whereas payload 
content is modeled here. 

Our payload-based model has small memory consumption and is very efficient to 
compute. Table 2 displays the measurements of the speed and the resulting number of 
centroids for each of the models for both cases of unclustered and clustered. The 
results were derived from measuring PAYL on a 3GHz P4 Linux machine with 2G 
memory using non-optimized Java code. These results do not indicate how well a 
professionally engineered system may behave (re-engineering in C probably would 
gain a factor of 6 or more in speed). Rather, these results are provided to show the 
relative efficiency among the alternative modeling methods. The training and test 
time reported in the table is seconds per lOOMof data, which includes the I/O time. 
The number of centroids computed after training represents an approximation of the 
total amount of memory consumed by each model. Notice that each centroid has 
fixed size: two 256-element double arrays, one for storing averages and the other for 
storing the standard deviation of the 256 ASCII bytes. A re-engineered version of 
PAYL would not consume as much space as does a Java byte stream object. From the 
table we can see that clustering reduces the number of centroids, and total consumed 
memory by about a factor from 2 to 16 with little or no hit in computational perform- 
ance. Combining Figure 6, Table 1 and Table 2, users can choose the proper model 
for their application according to their environment and performance requirements. 
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Table 2. Speed and Memory measurements of each model. The training and testing time is in 
units of seconds per lOOM data, including the I/O time. The memory comsuption is measured 
in the number of centroids that were kept after clustering or learning 



Unclustered 

/Clustered 


Per 

Packet 


First 100 


Tail 100 


Per 

Conn. 


Trunc 

Conn. 


Train time(uncl) 


26.1 


21.8 


21.8 


8.6 


4.4 


Test time(uncl) 


16.1 


9.4 


9.4 


9.6 


1.6 


No. centroid(uncl) 


11583 


11583 


11583 


16326 


16326 


Train tme(clust) 


26.2 


22.0 


26.2 


8.8 


4.6 


Test time(clust) 


16.1 


9.4 


9.4 


9.6 


1.6 


No. centroid(clust) 


4065 


7218 


6126 


2219 


1065 



This result is surprisingly good for such a simple modeling technique. Most impor- 
tantly, this anomaly detector can easily augment existing detection systems. It is not 
intended as a stand alone detection system but a component in a larger system aiming 
for defense in depth. Hence, the detector would provide additional and useful alert 
information to correlate with other detectors that in combination may generate an 
alarm and initiate a mitigation process. The DARPA 99 dataset was used here so 
that others can verify our results. However, we also performed experiments on a live 
stream that we describe next. 



4.2 Experiments with CUCS Dataset 

The CUCS dataset denotes Columbia University CS web server dataset, which are 
two traces of incoming traffic with full payload to the CS department web server 
(www.cs.columbia.edu). The two traces were collected separately, one in August 
2003 for 45 hours with a size of about 2GB, and one in September 2003 for 24 hours 
with size 1GB. We denote the first one as A, the second one as B, and their union as 
AB. Because we did not know whether this dataset is attack-free or not, this experi- 
ment represents an unlabeled dataset that provides the means of testing and evaluat- 
ing the unsupervised training of the models. 



Table 3. Unsupervised learning result on CUCS dataset 



Train 


Test 


Anomaly # 


CR-II 


Buffer 


A 


A 


28(0.0084%) 






A 


B 


2601(1.3%) 


Yes 


Yes 


B 


A 


686(0.21%) 


— 


— 


B 


B 


184(0.092%) 


Yes 


Yes 


AB 


AB 


211(0.039%) 


Yes 


Yes 



First we display the result of unsupervised learning in Table 3. We used an unclus- 
tered single-length model since the number of training examples is sufficient to ade- 
quately model normal traffic. Also the smoothing factor is set to 0.001 and 256 as the 
anomaly threshold. Dataset A has 331,236 non-zero payload packets, and B has 
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199,881. The third column shows the number and percentage of packets that are 
deemed anomalous packets. Surprisingly, when we manually checked the anomalous 
packets we found Code Red II attacks and extremely long query string buffer over- 
flow attacks in dataset B. (“yes” means the attack is successfully detected.) 

There is a high anomaly rate when we train on A and test on B; this is because 
there are many pdf file-uploads in B that did not occur in A. (Notice the dates. A was 
captured during the summer; B was captured later during student application time.) 
Because pdf files are encoded with many nonprintable characters, these packets are 
very different from other normal HTTP request packets. For the rest of those detected 
packets, more than 95% are truly anomalous. They include malformed HTTP headers 

like “ : ”, a string with all capital letter’s, “Weferer” replacing the 

standard Referer tag (apparently a privacy feature of a COTS product), extremely 
long and weird parameters for “range”, javascripts embedded html files sent to the CS 
server, etc. These requests might do no harm to the server, but they are truly unusual 
and should be filtered or redirected to avoid a possible attack. They do provide 
important information as well to other detectors that may deem their connections 
anomalous for other reasons. 

Figure 7 displays a plot of the distance values of the normal packets against the at- 
tacks. For illustrative purposes, we selected some packets of the Code Red II attack 
and the buffer overflow attack, which has length 1460 and were detected to be 
anomalous, and compare these with the distances of the normal packets of the same 
length. The training and test data both use data set A for these plots. 
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Fig. 7. The computed Mahalanobis distance of the normal and attack packets 
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We also tested some other packets collected from online sources of virus’s as they 
appeared in the wild and within the DARPA 99 data set. These were tested against 
the cues dataset. They include Code Red I & II, Nimbda, crashiis, back, apache2 
etc. All of these tepdump packets containing virus’s were successfully caught by our 
model. 

For illustrative purposes, we also display in Table 4 the Z-strings of the Code Red 
II and, the buffer overflow attacks and the centroid Z-string to demonstrate how dif- 
ferent each appears from the norm. Notice these are the Z-strings for one of the single 
malicious packets we captured at packet payload length 1460. Because the full Z- 
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string is too long (more than 200 characters) and contains many nonprintable charac- 
ters, we only display the first 20 characters’ ASCII value in decimal for illustration. 
The buffer overflow packet only has 4 different characters, so its Z-string has length 
4 and are all displayed in the table. 



Table 4. Illustrations of the partial Z-strings. The characters are shown in ASCII 



1 Code Red II (first 20 characters) | 


88 


0 


255 


117 


48 


85 


116 


37 


232 


100 


100 


106 


69 


133 


137 


80 


254 


1 


56 


51 


1 Buffer Overflow (all) | 


1 65 


37 


48 


68 














1 Centroid (first 20 characters) | 


48 


73 


146 


36 


32 


46 


61 


113 


44 


110 


59 


70 


45 


56 


50 


97 


110 


115 


51 


53 



5 Conclusion 

The experimental results indicate that the method is effective at detecting attacks. In 
the 1999 DARPA IDS dataset, the best trained model for TCP traffic detected 57 
attacks out of 97 with every port’s false positive rate lower than 1%. For port 80, it 
achieves almost 100% detection rate with around 0.1% false positive rate. It also 
successfully detected the Code Red II and a buffer overflow attack from the unlabeled 
cues dataset. The payload model is very simple, state-free, and quick to compute in 
time that is linear in the payload length. It also has the advantage of being imple- 
mented as an incremental, unsupervised learning method. The payload anomaly de- 
tector is intended to be correlated with other detectors to mitigate against false alarms, 
and to increase the coverage of attacks that may be detected. The experiment also 
demonstrated that clustering of centroids from neighboring length bins dramatically 
reduce memory consumption up to a factor of 16. 

The Z-string derived from the byte distributions can be used as a “signature” to 
characterize payloads. Each such string is at most 256 characters, and can be readily 
stored and communicated rapidly among sites in a real-time distributed detection 
system as “confirmatory” evidence of a zero-day attack. This can be accomplished 
faster than is otherwise possible by observing large bursts in probing activity among a 
large segment of the internet. This approach may also have great value in detecting 
slow and stealthy worm propagations that may avoid activities of a bursty nature! 

In our future work, we plan to evaluate the technique in live environments, imple- 
ment and measure the costs and speed of the Z-string distribution mechanism and 
most interestingly whether higher order n-grams provide added value or not in model- 
ing payload. Furthermore, we plan to evaluate the opportunity or difficulty for mim- 
icry attack by comparing the payload distributions across different sites. If, as we 
suspect, each site’s payload distributions are consistently different (in a statistical 
sense), then the anomaly detection approach proposed here, based upon site-specific 
payload models, will provide protection for all sites. 
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Abstract. Anomaly detection is a promising approach to detecting intruders 
masquerading as valid users (called masqueraders). It creates a user profile and 
labels any behavior that deviates from the profile as anomalous. In anomaly de- 
tection, a challenging task is modeling a user’s dynamic behavior based on se- 
quential data collected from computer systems. In this paper, we propose a novel 
method, called Eigen co-occurrence matrix (ECM), that models sequences such 
as UNIX commands and extracts their principal features. We applied the ECM 
method to a masquerade detection experiment with data from Schonlau et al. We 
report the results and compare them with results obtained from several conven- 
tional methods. 

Keywords: Anomaly detection. User behavior. Co-occurrence matrix, PCA, Lay- 
ered networks 

1 Introduction 

Detecting the presence of an intruder masquerading as a valid user is becomiug a critical 
issue as security iucideuts become more commou aud more serious. Auomaly detectiou 
is a promisiug approach to detectiug such iutruders (masqueraders). It first creates a 
profile defiuiug a uormal user’s behavior. It theu measures the similarity of a curreut 
behavior with the created profile aud uotes auy behavior that deviates from the profile. 
Various approaches for auomaly detectiou differ iu how they create profiles aud how 
they defiue similarity. 

lu most masquerade detectiou methods, a profile is created by modeliug sequeu- 
tial data, such as the time of logiu, physical locatiou of logiu, duratiou of user sessiou, 
programs executed, uames of files accessed, aud user commauds issued [1], Oue of the 
challeugiug tasks iu detectiug masqueraders is to accurately model user behavior based 
ou such sequeutial data. This is challeugiug because the uature of a user’s behavior is 
dyuamic aud difficult to capture completely. Iu this paper, we propose a uew method, 
called Eigeu co-occurreuce matrix (ECM), desigued to model such dyuamic user be- 
havior. 

E. Jonsson et al. (Eds.): RAID 2004, LNCS 3224, pp. 223-237, 2004. 
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One of the approaches to modeling user behavior is to convert a sequence of data 
into a feature vector by accumulating measures of either unary events (histogram) or 
n-connected events (n-grams) [2^]. However, the former approach only considers the 
number of occurrences of observed events within a sequence, and thus sequential in- 
formation will not be included in the resulting model. The latter approach considers 
n-connected neighboring events within a sequence. Neither of them considers any cor- 
relation between events that are not adjacent to each other. 

Other approaches to modeling user behavior are based on converting a sequence 
into a network model. Such approaches include those based on an automaton [5-8], a 
Bayesian network [9], and an Hidden Markov Model (HMM) [10,1 1]. 

The nodes and arcs in an automaton can remember short- and long-range transition 
relations between events by constructing rules within a sequence of events. To construct 
an automaton, we thus require well-defined rules that can be transformed to a network. 
However, it is difficult to construct an automaton based on a set of user-generated se- 
quences with various contexts, which does not have such well-defined rules. When an 
automaton can indeed be obtained, it is computationally expensive to learn on the au- 
tomaton when a new sequence is added. 

A node in a Bayesian network associates probabilities of the node being in a spe- 
cific state given the states of its parents. The parent-child relationship between nodes 
in a Bayesian network indicates the direction of causality between the corresponding 
variables. That is, the variable represented by the child node is causally dependent on 
those represented by its parents. The topology of a Bayesian network must be prede- 
fined, however, and thus, the capability for modeling a sequence is dependent on the 
predefined topology. 

An HMM can model a sequence by defining a network model that usually has a 
feed-forward characteristic. The network model is created by learning both the prob- 
ability of each event emerging from each node and the probability of each transition 
between nodes by using a set of observed sequences. However, it is tough to build an 
adequate topology for an HMM by using ad hoc sequences generated by a user. As a 
result, the performance of a system based on an HMM varies depending on the topology 
and the parameter settings. 

We argue that the dynamic behavior of a user appearing in a sequence can be cap- 
tured by correlating not only connected events but also events that are not adjacent 
to each other while appearing within a certain distance (non-connected events). Based 
on this assumption, to model user behavior, the ECM method creates a so-called co- 
occurrence matrix by correlating an event in a sequence with any following events that 
appear within a certain distance. The ECM method then creates so-called Eigen co- 
occurrence matrices. The ECM method is inspired by the Eigenface technique, which 
is used to recognize humans facial images. In the Eigenface technique, the main idea 
is to decompose a facial image into a small set of characteristic feature images called 
eigenfaces, which may be thought of as the principal components of the original im- 
ages. These eigenfaces are the orthogonal vectors of a linear space. A new facial image 
is then reconstructed by projecting onto the obtained space. In the ECM method, we 
consider the co-occurrence matrix and the Eigen co-occurrence matrices analogous to a 
facial image and the corresponding eigenfaces, respectively. The Eigen co-occurrence 
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matrices are characteristic feature sequences, and the characteristic features of a new 
sequence converted to a co-occurrence matrix are obtained by projecting it onto the 
space dehned by the Eigen co-occurrence matrices. 

In addition, the ECM method constructs the extracted features as a layered net- 
work. The distinct principal features of a co-occurrence matrix are presented as layers. 
The layered network enables us to perform detailed analysis of the extracted principal 
features of a sequence. 

In summary, the ECM method has three main components: (1) modeling of the 
dynamic features of a sequence; (2) extraction of the principal features of the resulting 
model; and (3) automatic construction of a layered network from the extracted principal 
features. 

The reminder of the paper is organized as follows. In Section 2, the ECM method 
is described in detail by using an example set of UNIX commands. Section 3 applies 
the ECM method to detect anomalous users in a dataset, describes our experimental 
results, and compares them with results obtained from several conventional methods. 
Section 4 analyzes the computational cost involved in the ECM method. Section 5 dis- 
cusses possible detection improvements in using the ECM method. Section 6 gives our 
conclusions and describes our future work. 

2 The Eigen Co-occurrence Matrix (ECM) Method 

The purpose of this study is to distinguish malicious users from normal users. To do so, 
we first need to model a sequence of user commands and then apply a pattern classifica- 



Domain Dataset 




Fig. 1. Overall procedure of the ECM method 
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Table 1. Notation and terminology 



I length of an observation sequence 

s maximum distance over which correlations between events are considered (scope size) 
O set of observation events 
m number of events in O 

D set of sample observation sequences {domain dataset) 

n number of sample sequences in D 

M a co-occurrence matrix 

Vi ith Eigen co-occurrence matrix 

F a feature vector 

fi ith component of F 

N dimension size of F 

Xi a matrix for producing ith positive network layer 
Yi a matrix for producing ith negative network layer 
h threshold of elements in Xt (or F,) to construct a network layer 
R number of elements in /JV'; for constructing the ith network layer 
r number of nodes in a subnetwork 



time 

Userl cd Is less Is less cd Is cd cd Is 

User2 emacs gcc gdb emacs Is gcc gdb Is Is emacs 

User! mkdir cp cd Is cp Is cp cp cp cp 

Fig. 2. Example dataset of UNIX commands 



tion method. To accurately classify a sequence as normal or malicious, it is necessary to 
extract its significant characteristics {principal features) and, if necessary, convert the 
extracted features into a form suitable for detailed analysis. In this section, we explain 
how the ECM method models a sequence, how it obtains the principal features, and how 
it constructs a model for detailed analysis, namely, a network model. The overall pro- 
cedure of the ECM method is illustrated in Eigure 1 and the notation and terminology 
used in the ECM method are listed in Table 1 . 

In the following sections, we explain each procedure in the ECM method by using 
a simple example of UNIX command sequences. Figure 2 shows an example dataset 
of UNIX commands for three users, designated as Userl, User2, and User3. Each user 
issued ten UNIX commands, which are shown truncated (without their arguments) in 
the interest of simplicity. 

2.1 Modeling a Sequence 

The ECM method models a sequence by correlating an event with any following events 
that appear within a certain distance. The strength of the correlation between two events 
is defined by (a) the distance between events and (b) the frequency of their occurrence. 
In other words, when the distance between two events is short, or when they appear 
more frequently, their correlation becomes stronger. To model such strength of corre- 
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s = 6 



s = 6 



c6 ils) Jess-(l^ Jess- cd Is cd cd Is 



strength of Correlation : 2 -t 1 = 3 
Fig. 3. Correlation between Is and less for Userl 
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Fig. 4. Co-occurrence matrix of Userl 



lation between events, we construct a so-called co-occurrence matrix by counting the 
occurrence of every event pair within a certain distance {scope size). Thus, the correla- 
tions of both connected and non-connected events are captured for every event pair and 
subsequently represented in the matrix. 

We define Mx as the co-occurrence matrix of a sequence X (- x\,X 2 , X 3 , . . xi) with 
length /. We define the unique events appearing in the sequence as a set of observation 
events, denoted as O {- 01 , 02 , 03 , ■ ■ ■ , Om). In the example dataset of Figure 2, O is cd 
Is less emacs gcc gdb mkdir cp. The correlation between the /th and yth events 
in Mx, Oi and Oj, is computed by counting the number of occurrences of the event- 
pair within a scope size of s. Here, we did not change the strength of the correlations 
between events depending on their distance, but instead used a constant value 1 for 
simplicity. Doing this for every event pair generates a matrix representing all of the 
respective occurrences. Each element in the matrix represents the perceived strength of 
correlation between two events. For example, as illustrated in Figure 3, the events Is 
and less are correlated with a strength of three when s and / are defined as 6 and 10, 
respectively. Figure 4 shows the matrix generated from the sequence of Userl. 

2.2 Extracting the Principal Features 

As explained earlier, to distinguish a malicious user from a normal user, it is nec- 
essary to introduce a pattern classification method. Measuring the distance between 
co-occurrence matrices is considered the simplest pattern classification method. A co- 
occurrence matrix is highly dimensional, however, and to make an accurate comparison, 
it is necessary to extract the matrix’s principal features. 

The ECM method uses principal component analysis (PCA) to extract the principal 
features, so-called feature vectors. PCA transforms a number of correlated variables 



228 



Mizuki Oka et al. 



into a smaller number of uncorrelated variables called principal components. It can thus 
reduce the dimensionality of the dataset while retaining most of the original variability 
within the data. The process for obtaining a feature vector is divided into the following 
five steps: 

( Step 1 ) Take a domain dataset and convert its sequences to co-occurrence matrices: 
As a first step {Step 1 in in Figure 1), we take a set of sample sequences, which we call 
a domain dataset and denote as D, and convert the sequences into corresponding co- 
occurrence matrices, Mi, M2, M3, ..., M„, where n is the number of sample observation 
sequences and M is an m x m matrix (m: number of observation events). In the current 
example, the domain dataset consists of all the three users’ sequences (n = 3), and M is 
an 8 X 8 matrix (m = 8). 

(Step 2) Subtract the mean: We then take the set of co-occurrence matrices Mi, M2, 
M3, ..., M„ and compute its mean co-occurrence matrix Mmean (Step 2 in Figure 1). Flere 
we introduce two different ways to compute Mmean- The first way is to compute it nor- 
mally: 

1 " 

^mean — ^ • ( 1 ) 

n ^ 

k=\ 

The second way is to compute Mmean by taking into account the fact that a co- 
occurrence matrix can be sparse. Let rnmeanik j) be the fth-row yth-column element of 
the mean co-occurrence matrix Mmean- We then compute mmeanii, j) by taking the sum 
of all the values in mi(i, j), m 2 (i, j), m^ii, j ), . . . , m„(i, j) and dividing by the number of 
those values that are non-zero. In summary. 



1 ” 

ml) 



( 2 ) 



where mi^(i, j) is the fth-row yth-column element of the kth co-occurrence matrix, and 
K(i, j) and <5[x] are defined as 



K(i, j) = ^ d[mk(i , ;)] 



( 3 ) 



k=l 



and 

( 1 if X is not equal to zero 

, ( 4 ) 

0 otherwise 

respectively. The mean co-occurrence matrix Mmean is then subtracted from each event 
co-occurrence matrix. 



Ak^Mk-Mmean for k = 1, 2, 3, . . . , «, 



( 5 ) 



where is the kth co-occurrence matrix with the mean subtracted. 
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(Step 3) Calculate the covariance matrix: We then construct the covariance matrix as 

n 

P = (6) 

k=\ 

where A*, is created by taking each row in A^ and concatenating its elements into a single 
vector (Step 3 in Figure 1). The dimension of A*: is 1 x m^. In the example dataset, the 
dimension of A*: is 1 x 64. 

The components of P, denoted by pij, represent the correlations between two event 
pairs qi and qj, such as the event pairs (Is less) and (Is cd) in the example dataset. An 
event pair qi (- o^, Oy) can be obtained by 

X - y[(i - l)/»j] + 1 

y — i — yY(i — l)/»r] X m, (1) 

where y[z] is the integer part of the value. The variance of a component indicates the 
spread of the component values around its mean value. If two components q, and qj 
are uncorrelated, their variance is zero. By definition, the covariance matrix is always 
symmetric. 



(Step 4) Calculate the eigenvectors and eigenvalues of the covariance matrix: Since 
the covariance matrix P is symmetric (its dimension is x rrP', or 64 x 64 in the 
example dataset), we can calculate an orthogonal basis by finding its eigenvalues and 
eigenvectors (Step 4 in Figure 1). The eigenvector with the highest eigenvalue is the 
first principal component (the most characteristic feature) since it implies the highest 
variance, while the eigenvector with the second highest eigenvalue is the second prin- 
cipal component (the second most characteristic feature), and so forth. By ranking the 
eigenvectors in order of descending eigenvalues, namely (vi, V 2 , ..., v^z), we can create 
an ordered orthogonal basis according to significance. Since the eigenvectors belong to 
the same vector space as the co-occurrence matrices, v, can be converted to an m x m 
matrix (8 x 8 in the example dataset). We call such a matrix an Eigen co-occurrence 
matrix and denote it as V). 

Instead of using all the eigenvectors, we may represent a co-occurrence matrix by 
choosing N of the eigenvectors. This compresses the original co-occurrence matrix 
and simplifies its representation without losing much information. We define these N 
eigenvectors as the co-occurrence matrix space. Obviously, the larger N is, the higher 
the contribution rate of all the eigenvectors becomes. The contribution rate is defined as 



contribution rate = 




( 8 ) 



where T, denotes the ith largest eigenvalue. 

( Step 5 ) Obtain a feature vector: We can obtain the feature vector of any co-occurrence 
matrix, M, by projecting it onto the defined co-occurrence matrix space (Step 5 in Figure 
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Layer 1 




Fig. 5. Positive layered network for Userl 




Layer 1 
Layer 2 



Fig. 6. Combined positive layered network 
of Userl. The solid lines and dotted lines 
correspond to layer 1 and 2, respectively 



1). The feature vector of M is obtained by the dot product of 

vectors v, and A, where is defined as 

fi^vjA fort = 1,2,3, (9) 

The Components f\, fi, h, ■■■, fn oi F are the coordinates within the co-occurrence ma- 
trix space. Each component represents the contribution of each respective Eigen co- 
occurrence matrix. Any input sequence can be compressed from nP' to N while main- 
taining a high level of variance. 



2.3 Constructing a Layered Network 

Once a feature vector F is obtained from a co-occurrence matrix, the ECM method 
converts it to a so-called layered network (shown as construction of layered network in 
Eigure 1 ) . The ith layer of a network is constructed from the corresponding ith Eigen 
co-occurrence matrix E, multiplied by the /th coordinate f of F. In other words, the /th 
layer of the network represents the ith principal feature of the original co-occurrence 
matrix. 

The layered network can be obtained from equation (9). Recall that this equation 
for obtaining a component f (for i - 1 , 2, 3, . . . , A) of a feature vector is 

fi^vjA fori = 1,2,3,..., A, 

where A is the vector representation of A = (M - M„,ean)- We can obtain an approxima- 
tion to the original co-occurrence matrix M’ with the mean M„tean subtracted from the 
original co-occurrence matrix M by isolating A from equation (9). In summary, 

N 

(M - M„,an) - 2 fiVi = M', (10) 

i=l 

where fVi can be considered an adjacency matrix labeled by the set of observation 
events O. The ith network layer can be constructed by connecting the elements in the 
obtained matrix M’ . 
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Layer 2 



Fig. 7. Negative layered network for Userl 




Fig. 8. Combined negative layered network 
for Userl. The solid and dotted lines corre- 
spond to layers 1 and 2, respectively 



Each layer of the network constructed by //V, (for i - 1,2,3, N) provides the 
distinct characteristic patterns observed in the approximated co-occurrence matrix. We 
can also express such characteristics in relation to the average co-occurrence matrix by 
separating it as 

N N N N 

2 fiVi = + Yd = 2 + Z 

/=1 1=1 1=1 /=1 

where Z, (or F,) denotes an adjacency matrix whose elements are determined by the 
corresponding positive (or negative) elements in /j U,. The matrix Z, (or F,) represents 
the principal characteristic of M' in terms of frequency (or rarity) in relation to the 
average co-occurrence matrix. We call the network obtained from Z, (or F,) a positive 
(or negative) network. 

There may be elements in Z, (or F,) that are too small to serve as principal charac- 
teristics of M'. Thus, instead of using all the elements of Z, (or F,), we set a threshold 
h and choose elements that are larger (or smaller) than h (or -h) in order to construct 
the ith layer of the positive (or negative) network. Assigning a higher value to h reduces 
the number of nodes in the network and consequently creates a network with a different 
topology. 

Figure 5 shows the first and second layers of the positive networks, obtained for 
Userl in the example dataset with h assigned to 0. We can combine these two layers 
to describe Userl ’s overall patterns of principal frequent commands. The combined 
network is depicted in Figure 6, which indicates strong relations between the commands 
Is, cd, and less. This matches our human perception of the command sequence of 
Userl (i.e., cd Is less Is less cd Is cd cd Is). 

Similarly, the first and second layers of the negative network and the combined net- 
work obtained for Userl are shown in Figures 7 and 8, respectively. These negative 
networks indicate the rarely observed command patterns in the command sequence of 
Userl relative to the average observed command patterns. We can observe strong cor- 
relations in the commands gdb, gcc, Is, and emacs. These relations did not appear in 
the command sequence. 
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Fig. 9. Composition of the experimental dataset 



3 Application of the ECM Method 

3.1 Overview of the Experimental Data 

We applied the ECM method to a dataset for masquerade detection provided by Schon- 
lau et al. [12]. The dataset consists of 50 users’ commands entered at a UNIX prompt, 
with 15,000 commands recorded for each user. Due to privacy arguments, the dataset 
includes no reporting of flags, aliases, arguments, or shell grammar. The users are des- 
ignated as User 1, User 2, and so on. The first 5000 commands are entered by the 
legitimate user, and the masquerading commands are inserted in the remaining 10,000 
commands. All the user sequences were decomposed into a sequence length of 100 
commands (/ = 100). Figure 9 illustrates the composition of the dataset. 

3.2 Creation of a User Profiles (Offline) 

For each user, we created a profile representing his normal behavior. Each decom- 
posed sequence was converted into a co-occurrence matrix with a scope size of six 
(v = 6). We did not change the strength of the correlations between events on depend- 
ing on their distance but instead used a constant value 1 for simplicity. We took all of 
the users’ training dataset, consisting of 2500 (50 sequences x 50 users) decomposed 
sequences, and defined it as the domain dataset (n = 2500). The set of observation 
events (O - 01 , 02 , 03 , . . . , o^) was determined by the unique events appearing in the 
domain dataset, which accounted for 635 commands (m - 635). We took 50 Eigen 
co-occurrence matrices {N - 50), whose contribution rate was approximately 90%, and 
defined this as the co-occurrence matrix space. 

The profile of a user was created by using his training dataset. We first converted 
all of his training sequences to co-occurrence matrices and obtained the corresponding 
feature vectors by projecting them onto the defined co-occurrence matrix space. Each 
feature vector was then used to reconstruct an approximated original co-occurrence ma- 
trix. This co-occurrence matrix was finally converted into a positive (or negative) lay- 
ered network with a threshold of0{h- 0). We only used the positive layered network 
to define each user’s profile. 

3.3 Recognition of Anomalous Sequences (Online) 

When a sequence seqi of the User u was to be tested, we followed this procedure: 

1 . Construct a co-occurrence matrix from seq,. 

2. Project the obtained co-occurrence matrix on the co-occurrence matrix space and 
obtain its feature vector. 
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Fig. 10. ROC curves for the ECM method 



3. Multiply the feature vector by the Eigen co-occurrence matrices to obtain a layered 
network. 

4. Compare the layered network with the profile of User u. 

5. Classify the testing sequence as anomalous or normal based on a threshold 

To classify a testing sequence seqt as anomalous or normal, we computed the sim- 
ilarity between each network layer of seqi and each networks layer in the user profile, 
where we chose the largest value as the similarity. If the computed similarity of seqi 
was under a threshold for the User u, then the testing sequence was classified as 
anomalous; otherwise, it was classified as normal. We defined the similarity between 
the networks of two sequences, seqi and seqj, as, 

N 

Sim(seqi, seqj) = ^ r(r^(/), r^O’)), (12) 

k=\ 

where Tt{i) is the obtained network at the klh layer of seqi and r(Tk(i), Tt{j)) is the 
number of subnetworks that Tk(i) and Tk{j) have in common. We extracted the 30 
largest values to form a network {R - 30) and employed 3 connected nodes as the 
unit of a subnetwork (r = 3). 

3.4 Results 

The results illustrate the trade-off between correct detection (true positives) and false 
detection (false positives). A receiver operation characteristic curve (ROC curve) is of- 
ten used to represent this trade-off. The percentages of true positives and false positives 
are shown on the y-axis and x-axis of the ROC curve, respectively. Any increase in 
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% False Detection Rate 



Fig. 11. ROC curve for the ECM method with the best results from other methods shown for 
comparison 



the true positive rate will be accompanied by an increase in the false positive rate. The 
closer the curve follows the left-hand border and then the top border of the ROC space, 
the more accurate the results are, since they indicate high true positive rates and, corre- 
spondingly, low false positive rates. 

Figure 10 shows the resulting ROC curve obtained from our experiment with the 
ECM method. We have plotted different correct detection rates and false detection rates 
by changing a in the expression: 

opt , 

+a, 

where is the optimal threshold for User u. The optimal threshold eu’” is defined 
by finding the largest correct detection rate with a false detection rate of less than [i%. 
We set jS to 20 in this experiment and used the same values of throughout all the 
test sequences (no updating). As a result, the ECM method achieved a 72.3% correct 
detection rate with a 2.5% false detection rate. 

Schonlau et al. [12] andMaxion et al. [13] have applied a number of masquerade de- 
tection techniques, including Bayes 1-Step Markov, Hybrid Multi-Step Markov, IPAM, 
Uniqueness, Sequence-Match, Compression, and Naive Bayes, to the same dataset used 
in this study. (See refs. [12] and [13] for detailed explanations of each technique.) Their 
results are shown in Eigure 1 1 along with our results from the ECM method. As one 
can be seen from the data, the ECM method achieved one of the best scores among the 
various approaches. 

4 Computational Cost 

The ECM method has two computational phases, the offline and online phases. For the 
offline phase, the required computation processes are the following: transforming a set 
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Table 2. Changeable parameters in obtaining a feature vector 



0 set of observation events. 

1 length of sequence to be tested, 
s scope size 

D domain dataset. 



Table 3. Changeable parameters in obtaining a layer of network 



h threshold of elements in (or K,) for constructing a network. 

R number of elements in fVi for constructing the ith network layer 
r number of nodes in a subnetwork. 



training sequences of length w to co-occurrence matrices, calculating the N eigenvec- 
tors of the covariance matrix, projecting co-occurrence matrices onto the co-occurrence 
matrix space to obtain feature vectors, constructing layered networks with R nodes in 
each layer, and generating a lookup table containing subnetworks with r connected 
nodes. 

We used the Linux operating system (RedHat 9.0) for our experiments. We im- 
plemented the conversion of a sequence to a co-occurrence matrix in Java SDK 1.4.2 
[14] and the remaining processes in Matlab Release 13 [15]. The hardware platform 
was a Dell Precision Workstation 650 (Intel(R) Xeon (TM) CPU 3.20GHz, 4GB main 
memory, 120GB hard disk). With this environment, for the online phase, it took 26.77 
minutes to convert all the user training sequences (/ = 100, s = 6) to the co-occurrence 
matrices (average of 642 ms each), 23.60 minutes to compute the eigenvectors {N - 50), 
6.76 minutes to obtain all the feature vectors (average of 162 ms each), 677.1 minutes 
to construct all the layered networks with 30 nodes in each layer (average of 16.25 s for 
each feature vector), and 106.5 minutes to construct the lookup table (r = 3). 

For the online phase, the required computations are the following: transforming a 
sequence to a co-occurrence matrix, projecting the obtained co-occurrence matrix to the 
set of N Eigen co-occurrence matrices, obtaining the feature vector of the co-occurrence 
matrix, constructing a layered network with R nodes, generating subnetworks with r 
connected nodes, and comparing the obtained layered network with the corresponding 
user prohle. For one testing sequence, using the same environment described above, it 
took 642 ms to convert the sequence {I - 100, s = 6) to the co-occurrence matrix, 162 
ms to obtain the feature vector {N - 50), 16.25 s to construct the layered network {R - 
30), 2.60 s to generate the subnetworks (r = 3), and 2.48 s to compare the subnetworks 
with the prohle. In total, it took 22.15 s to classify a testing sequence as normal or 
anomalous. 

5 Discussion 

As noted above, we have achieved better results than the conventional approaches by 
using the ECM method. Modeling a user’s behavior is not a simple task, however, and 
we did not achieve very high accuracy with false positive rates near to zero. There is 
room to improve the performance by varying the parameters of the ECM method, as 
shown in Tables 2 and 3. 
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Table 2 lists the parameters that can be changed when computing a feature vector 
from a co-occurrence matrix. The parameter O determines the events for which correla- 
tions with other events are considered. If we took a larger number of events (i.e., UNIX 
commands), the accuracy of the results would become better but the computational cost 
cost would increase. Thus, the number of events represents a trade-off between accuracy 
and computational cost. 

Changing the parameter / results in a different length of test sequence. Although we 
set / to 100 in our experiment in order to compare the results with those of conventional 
methods, it could be changed by using a time stamp, for example. The parameter s 
determines the distance over which correlations between events are considered. If we 
assigned a larger value to s, two events separated by a longer time interval could be 
correlated. In our experiment, we did not consider the time in determining the values 
of I and s, but instead utilized our heuristic approach, as the time was not included 
in the dataset. Moreover, we did not change the strength of the correlations between 
events depending on their distance for simplicity. Considering the aspect of dividing the 
number of occurrences by the distance between events, for example, would influence 
the results. 

Choosing more sequences for the domain dataset D would result in extracting of 
more precise features from each sequence, as in the case of the Eigenface technique. 
This aspect could be used to update the profile of each user: updating the domain dataset 
would automatically update its extracted principal features, since they are obtained by 
using Eigen co-occurrence matrices. 

Table 3 lists the parameters that can be changed in constructing a network layer 
from a co-occurrence matrix. In our experiment, we set = 0 and chose the largest 30 
elements {R - 30) to construct a positive network. Nevertheless, the optimal values of 
these parameters are open for discussion. 

Additionally, the detection accuracy would be increased by computing the mean 
co-occurrence matrix Mq by using equation (2) instead of equation (1), since each orig- 
inal co-occurrence matrix is sparse. Moreover, normalization of r(Tk(i), Tki j)) by the 
number of arcs (or nodes) in both Ttii) and T^U) may improve the accuracy: let \Ttij)\ 
be the number of arcs (or nodes) in network Tt{i). Then the normalized r(Tk(i), Tki j)) 
would be simply obtained by r{Tk{i), Tk{j))K\Tk{i)\\Tk{j)\). 



6 Conclusions and Future Work 

Modeling user behavior is a challenging task, as it changes dynamically over time and 
a user’s complete behavior is difficult to define. We have proposed the ECM method to 
accurately model such user behavior. The ECM method is innovative in three aspects. 
Eirst, it models the dynamic natures of users embedded in their event sequences. Sec- 
ond, it can discover principal patterns of statistical dominance. Einally, it can represent 
such discovered patterns via layered networks, with not only frequent (positive) proper- 
ties but also rare (negative) properties, where each layer represents a distinct principal 
pattern. 

Experiments on masquerade detection by using UNIX commands showed that the 
ECM method achieved better results, with a higher correct detection rate and a lower 
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false detection rate, than the results obtained with conventional approaches. This sup- 
ports our assumption that not only connected events but also non-connected events 
within a certain scope size are correlated in a command sequence. It also shows that 
the principal features from the obtained model of a user behavior are successfully ex- 
tracted by using PCA, and that detailed analysis by using layered networks can provide 
sufficient, useful features for classification. 

Although we used the layered networks to classify test sequences as normal or ma- 
licious in our experiment, we should also investigate classification by using only the 
feature vectors. Furthermore, we need to conduct more experiments by varying the 
method’s parameters, as described in Section 5, in order to improve the accuracy for 
masquerade detection. We must also try using various matching network algorithms to 
increase the accuracy. 
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Abstract. This paper proposes a new approach to detecting aggregated anoma- 
lous events by correlating host file system changes across space and time. Our 
approach is based on a key observation that many host state transitions of in- 
terest have both temporal and spatial locality. Abnormal state changes, which 
may be hard to detect in isolation, become apparent when they are correlated 
with similar changes on other hosts. Based on this intuition, we have developed a 
method to detect similar, coincident changes to the patterns of file updates that are 
shared across multiple hosts. We have implemented this approach in a prototype 
system called Seurat and demonstrated its effectiveness using a combination of 
real workstation cluster traces, simulated attacks, and a manually launched Linux 
worm. 

Keywords: Anomaly detection. Pointillism, Correlation, File updates. Clustering 



1 Introduction 

Correlation is a recognized techniqne for improving the effectiveness of intrusion de- 
tection by combining information from multiple sources. For example, many existing 
works have proposed correlating different types of logs gathered from distributed mea- 
surement points on a network (e.g., [1-3]). By leveraging collective information from 
different local detection systems, they are able to detect more attacks with fewer false 
positives. 

In this paper, we propose a new approach to anomaly detection based on the idea of 
correlating host state transitions such as file system updates. The idea is to correlate host 
state transitions across both space (multiple hosts) and time (the past and the present), 
detecting similar coincident changes to the patterns of host state updates that are shared 
across multiple hosts. Examples of such coincident events include administrative up- 
dates that modify files that have not been modified before, and malware propagations 
that cause certain log files, which are modified daily, to cease being updated. 

Our approach is based on the key observation that changes in host state in a network 
system often have both temporal and spatial locality. Both administrative updates and 
malware propagation exhibit spatial locality, in the sense that similar updates tend to 
occur across many of the hosts in a network. They also exhibit temporal locality in the 
sense that these updates tend to be clustered closely in time. Our goal is to identify 
atypical such aggregate updates, or the lack of typical ones. 

E. Jonsson et al. (Eds.): RAID 2004, LNCS 3224, pp. 238-257, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 
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Fig. 1. Pointillist approach to anomaly detection: Normal points are clustered by the dashed circle. 
The appearance of a new cluster consisting of three points suggests anomalous events on host A, 
B, and D. 



By exploring both the temporal and spatial locality of host state changes in a net- 
work system, our approach identifies anomalies without foreknowledge of normal 
changes and without system-specific knowledge. Existing approaches focus on the tem- 
poral locality of host state transitions, while overlooking the spatial locality among dif- 
ferent hosts in a network system. They either define a model of normal host state change 
patterns through learning, or specify detailed rules about normal changes. The learning 
based approaches train the system to learn characteristics of normal changes. Since they 
focus only on the temporal locality of single-host state transitions, any significant devi- 
ation from the normal model is suspicious and should raise an alarm, resulting in a high 
false positive rate. Rule-based approaches such as Tripwire [4] require accurate, specific 
knowledge of system configurations and daily user activity patterns on a specific hosf. 
Violation of rules then suggests malicious intrusions. Although rule-based anomaly de- 
tection raises fewer false alarms, it requires system administrators to manually specify 
a set of rules for each host. The correlation capability of our approach across both space 
and time allows us to learn the patterns of normal state changes over time, and to detect 
those anomalous events correlated among multiple hosts due to malicious intrusions. 
This obviates the need for specific rules while eliminating the false alarms caused by 
single host activity pattern shifts. 

The correlation is performed by clustering points, each representing an individual 
host state transition, in a multi-dimensional feature space. Each feature indicates the 
change of a file attribute, with all features together describing the host state transitions 
of an individual machine during a given period (e.g., one day). Over time, the abstrac- 
tion of point patterns inherently reflects the aggregated host activities. For normal host 
state changes, the points should follow some regular pattern by roughly falling into sev- 
eral clusters. Abnormal changes, which are hard to detect by monitoring that host alone, 
will stand out when they are correlated with other normal host state changes. Hence our 
approach shares some flavor of pointillism - a style of painting that applies small dots 
onto a surface so that from a distance the dots blend together into meaningful patterns. 

Figure 1 illustrates the pointillist approach to anomaly detection. There are five 
hosts in the network system. We represent state changes on each host daily as a point 
in a 2-dimensional space in this example. On normal days, the points roughly fall into 
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the dash-circled region. The appearance of a new cluster consisting of three points (in- 
dicated by the solid circle) suggests the incidence of anomaly on host A, B, and D, 
which may all have been compromised by the same attack. Furthermore, if we know 
that certain hosts (e.g., host A) are already compromised (possibly detected by other 
means such as a network based IDS), then we can correlate the state changes of the 
compromised hosts with the state changes of all other hosts in the network system to 
detect more infected hosts (e.g., host B and D). 

We have implemented a prototype system, called Seurat^, that uses file system up- 
dates to represent host state changes for anomaly detection. Seurat successfully detects 
the propagation of a manually launched Linux worm on a number of hosts in an isolated 
cluster. Seurat has a low false alarm rate when evaluated by a real deployment. These 
alarms are caused by either system re-configurations or network wide experiments. The 
false negative rate and detection latency, evaluated with simulated attacks, are both low 
for fast propagating attacks. For slowly propagating attacks, there is a tradeoff between 
false negative rate and detection latency. For each alarm, Seurat identifies fhe lisf of 
hosts involved and the related files, which we expecf will be exfremely helpful for sys- 
tem administrators to examine the root cause and dismiss false alarms. 

The rest of the paper is organized as follows: Section 2 describes Seurat threat 
model. Section 3 introduces the algorithm for correlating host state changes across both 
space and time. Section 4 evaluates our approach. Section 5 discusses the limitations of 
Seurat and suggests possible improvements. Section 6 presents related work. 

2 Attack Model 

The goal of Seurat is to automatically identify anomalous events by correlating the state 
change events of all hosts in a network system. Hence Seurat defines an anomalous 
even! as an unexpected state change close in time across multiple hosts in a network 
system. 

We focus on rapidly propagating Internet worms, virus, zombies, or other malicious 
attacks that compromise multiple hosts in a network system at a time (e.g., one or two 
days). We have observed that, once fast, automated attacks are launched, most of the 
vulnerable hosts get compromised due to the rapid propagation of the attack and the 
scanning preferences of the automated attack tools. According to CERT’s analysis [5], 
the level of automation in attack tools continues to increase, making it faster to search 
vulnerable hosts and propagate attacks. Recently, the Slammer worm hit 90 percent of 
vulnerable systems in the Internet within 10 minutes [6]. Worse, the lack of diversity in 
systems and softwares run by Internet-attached hosts enables massive and fast attacks. 
Computer clusters tend to be configured with the same operating systems and softwares. 
In such systems, host state changes due to attacks have strong temporal and spatial 
locality that can be exploited by Seurat. 

Although Seurat is originally designed to detect system changes due to fast propa- 
gating attacks, it can be generalized to detect slowly propagating attacks as well. This 
can be done by varying the time resolution of reporting and correlating the collective 
host state changes. We will discuss this issue further in Section 5. However, Seurat’s 

* Seurat is the 19th century founder of pointillism. 
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global correlation can not detect abnormal state changes that are unique to only a single 
host in the network system. 

Seurat represents host state changes using system updates. Pennington et al. [7] 
found that 83% of the intrusion tools and network worms they surveyed modify one 
or more system files. These modifications would be noticed by monitoring file system 
updates. There are many security tools such as Tripwire [4] and AIDE [8] that rely on 
monitoring abnormal file system updates for intrusion detection. 

We use the file name, including its complete path, to identify a file in the network 
system. We regard different instances of a file that correspond to a common path name 
as a same file across different hosts, since we are mostly interested in system files which 
tend to have canonical path names exploited by malicious attacks. We treat files with 
different path names on different hosts as different files, even when they are identical in 
content. 

For the detection of anomalies caused by attacks, we have found that this repre- 
sentation of host state changes is effective and useful. However, we may need different 
approaches for other applications of Seurat such as file sharing detection, or for the de- 
tection of more sophisticated future attacks that alter files at arbitrary locations as they 
propagate. As ongoing work, we are investigating the use of file content digests instead 
of file names. 



3 Correlation-Based Anomaly Detection 

We define a d-dimensional feature vector Hij = {vi,V 2 , ■ ■ ■ , Vd) to represent the file 
system update attributes for host i during time period j. Each Hij can be plotted as 
a point in a cZ-dimensional feature space. Our pointillist approach is based on corre- 
lating the feature vectors by clustering. Over time, for normal file updates, the points 
follow some regular pattern (e.g., roughly fall into several clusters). From time to time, 
Seurat compares the newly generated points against points from previous time periods. 
The appearance of a new cluster, consisting only of newly generated points, indicates 
abnormal file updates and Seurat raises an alarm. 

For clustering to work most effectively, we need to find the most relevant features 
(dimensions) in a feature vector given all the file update attributes collected by Seurat. 
We have investigated two methods to reduce the feature vector dimensions: (1) wavelet- 
based selection, and (2) principal component analysis (PCA). 

In the rest of this section, we first present how we define the feature vector space 
and the distances among points. We then describe the methods Seurat uses to reduce 
feature vector dimensions. Finally, we discuss how Seurat detects abnormal file updates 
by clustering. 

3.1 Feature Vector Space 

Seurat uses binary feature vectors to represent host file updates. Each dimension in the 
feature vector space corresponds to a unique file (indexed by the full-path file name). As 
such, the dimension of the space d is the number of file names present on any machine 
in the network system. We define the detection window to be the period that we are 
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interested in finding anomalies. In the current prototype, the detection window is one 
day. For each vector ffy = (wi , t; 2 , . . . , Vd), we set Vk to 1 if host i has updated (added, 
modified, or removed) the fc-th file on day j, otherwise, we set Vk to 0. 

The vectors generated in the detection window will be correlated with vectors gen- 
erated on multiple previous days. We treat each feature vector as an independent point 
in a set. The set can include vectors generated by the same host on multiple days, or 
vectors generated by multiple hosts on the same day. In the rest of the paper, we use 
V = {vi,V 2 , ■ ■ ■ , Vd) to denote a feature vector for convenience. Figure 2 shows how 
we represent the host file updates using feature vectors. 
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Fig. 2. Representing host file updates as feature vectors : Fi, F 2 , F 3 , F 4 , Fs are five different files 
(i.e., file names). Accordingly, the feature vector space has 5 dimensions in the example. 



The correlation is based on the distances among vectors. Seurat uses a cosine dis- 
tance metric, which is a common similarity measure between binary vectors [9, 10]. 
We define the distance D{Vi,V 2 ) between two vectors Vi and V 2 as their angle 9 
computed by the cosine value: 



D{Vi,V2) 



= COS 



f Vl-V2 \ 




Comparison Window 



Detection Window 



Correiation Window 



Fig. 3. Detection window, comparison window, and correlation window. The detection window is 
day j. The comparison window is from day j — t to day j — 1. The correlation window is from 
day j — t to day j. 



For each day j (the detection window), Seurat correlates the newly generated vec- 
tors with vectors generated in a number of previous days j — l,j — 2, . . .. We assume 
that the same abnormal file update events on day j, if any, have not occurred on those 
previous days. We define the comparison window of day j as the days that we look back 
for comparison, and the correlation window of day j as the inclusive period of day j 
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and its comparison window. Vectors generated outside the correlation window of day j 
are not used to identify abnormal file updates on day j. Figure 3 illustrates the concepts 
of detection window, comparison window, and correlation window. 

Since each vector generated during the comparison window serves as an example of 
normal hie updates to compare against in the clustering process, we explore the tempo- 
ral locality of normal update events by choosing an appropriate comparison window for 
each day. The comparison window size is a conhgurable parameter of Seurat. It reflects 
how far we look back into history to implicitly dehne the model of normal hie up- 
dates. For example, some hies such as / var / spool /anacron/ cron .weekly on 
Linux platforms are updated weekly. In order to regard such weekly updates as normal 
updates, administrators have to choose a comparison window size larger than a week. 
Similarly, the size of the detection window reflects the degree of temporal locality of 
abnormal update events. 

Since Seurat correlates hie updates across multiple hosts, we are interested in only 
those hies that have been updated by at least two different hosts. Files that have been up- 
dated by only one single host in the network system throughout the correlation window 
are more likely to be user hies. As such, we do not select them as relevant dimensions 
to dehne the feature vector space. 

3.2 Feature Selection 

Most hie updates are irrelevant to anomalous events even after we hlter out the hie 
updates reported by a single host. Those hies become noise dimensions when we cor- 
relate the vectors (points) to identify abnormal updates, and increase the complexity of 
the correlation process. We need more selective ways to choose relevant hies and reduce 
feature vector dimensions. Seurat uses a wavelet-based selection method and principal 
component analysis (PCA) for this purpose. 

Wavelet-Based Selection. The wavelet-based selection method regards each individual 
hie update status as a discrete time series signal S. Given a hie i, the value of the signal 




Fig. 4. Representing hie update status with wavelet transformation: The original signal is S, 
which can be decomposed into a low frequency signal cA rehecting the long term update trend, 
and a high frequency signal cD reflecting the daily variations from the long-term trend. 
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on day n, denoted by Si{n), is defined as the total number of hosts that update file i 
on day n in the network system. Each such signal Si can be decomposed into a low 
frequency signal cAi reflecting the long term update trend, and a high frequency signal 
cDi reflecting the day-to-day variation from the long term trend, (see Figure 4). If the 
high frequency signal cDi shows a spike on a certain day, we know that a significantly 
larger number of hosts updated file i than on a normal day. We then select file i as a 
relevant feature dimension in defining the feature vector space. 

Seurat detects signal spikes using the residual signal of the long-term trend. The 
same technique has been used to detect disease outbreaks[l 1]. To detect anomalies on 
day j, the algorithm takes as input the list of files that have been updated by at least two 
different hosts in the correlation window of day j. Then, from these files the algorithm 
selects a subset that will be used to define the feature vector space. 



For each file i: 

1 . Construct a time series signal: 

S| = cAj + cDj 

2. Compute the residual signal value of day y: 

Ri(j) = Si{j)-cA{j-1) 

3. If R|(j) > alpha, then select file / as a feature dimension 



Fig. 5. Wavelet-based feature selection. 
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Fig. 6. Wavelet transformation of file update status: (a) The original signal of the file update status 
(b) The residual signal after wavelet transformation. 



Figure 5 shows the steps to select features by wavelet-based method. Given a fixed 
correlation window of day j, the algorithm starts with constructing a time series signal 
Si for each file i, and decomposes Si into cAi and cDi using a single-level wavelet 
transformation as described. Then we compute the residual signal value Ri{j) of day j 
by subtracting the trend value cAi{j — 1) of day j — 1 from the original signal value 
Si{j) of day j. If Ri{j) exceeds a pre-set threshold a, then the actual number of hosts 
who have updated file i on day j is significantly larger than the prediction cAi(j — 1) 
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based on the long term trend. Therefore, Seurat selects file i as an interesting feature 
dimension for anomaly detection on day j. As an example, Figure 6 shows the original 
signal and the residual signal of a file using a 32-day correlation window in a 22-host 
teaching cluster. Note the threshold value a of each file is a parameter selected based 
on the statistical distribution of historical residual values. 



PCA-Based Dimension Reduction. PCA is a statistical method to reduce data di- 
mensionality without much loss of information [12]. Given a set of d-dimensional data 
points, PCA finds a set of d' orthogonal vectors, called principal components, that ac- 
count for the variance of the input data as much as possible. Dimensionality reduction 
is achieved by projecting the original d-dimensional data onto the subspace spanned by 
these d' orthogonal vectors. Most of the intrinsic information of the d-dimensional data 
is preserved in the d' -dimensional subspace. 

We note that the updates of different files are usually correlated. For example, when 
a software package is updated on a host, many of the related files will be modified 
together. Thus we can perform PCA to identify the correlation of file updates. 

Given a d-dimensional feature space , and a list of m feature vectors Vi,V 2 , ■ ■ ■, 
Vm G 2^2 , we perform the following steps using PCA to obtain a new list of feature 
vectors . . . ,V'^ € Zi' {d' < d) with reduced number of dimensions: 

1. Standardize each feature vector Vk = {vik,V 2 k, ■ ■ ■ , Vdk) {1 < k < m) hy sub- 
tracting each of its elements Vik by the mean value of the corresponding dimension 
Ui{l < i < d) . We use Vfc = {vik,V 2 k, ■ ■ ■ ,Vnk) G Z 2 to denote the standardized 
vector for the original feature vector V k ■ Then, 

E m 

• _1 Vij 

Vik = Vik - Ui [wnere m* = — — , 1 < i < d) 

m 

2. Use the standardized feature vectors Vi,V 2 , ■ ■ ■ , Vm as input data to PCA in 
order to identify a set of principal components that are orthogonal vectors defining 
a set of transformed dimensions of the original feature space Z^. Select the first d' 
principal components that count for most of the input data variances (e.g., 90% of 
data variances) to define a subspace Z 2 . 

3. Project each standardized feature vector Vk G onto the PCA selected subspace 
Z 2 to obtain the corresponding reduced dimension vector G . 

Note that PCA is complementary to wavelet-based selection. Once we fix the corre- 
lation window of a particular day, we first pick a set of files to define the feature vector 
space by wavelet-based selection. We then perform PCA to reduce the data dimension- 
ality further. 



3.3 Anomaly Detection by Clustering 

Once we obtain a list of transformed feature vectors using feature selection, we cluster 
the vectors based on the distance between every pair of them. 
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We call the cluster a new cluster if it consists of multiple vectors only from the 
detection window. The appearance of a new cluster indicates possibly abnormal file 
updates occurred during the detection window and should raise an alarm. 

There are many existing algorithms for clustering, for example, K-means [13, 14] or 
Single Linkage Hierarchical Clustering [10]. Seurat uses a simple iterative algorithm, 
which is a common method for K-means initialization, to cluster vectors without prior 
knowledge of the number of clusters [15]. The algorithm assumes each cluster has a 
hub. A vector belongs to the cluster whose hub is closest to that vector compared with 
the distances from other hubs to that vector. The algorithm starts with one cluster whose 
hub is randomly chosen. Then, it iteratively selects a vector that has the largest distance 
to its own hub as a new hub, and re-clusters all the vectors based on their distances to 
all the selected hubs. This process continues until there is no vector whose distance to 
its hub is larger than the half of the average hub-hub distance. 

We choose this simple iterative algorithm because it runs much faster, and works 
equally well as the Single Linkage Hierarchical algorithm in our experiments. The rea- 
son that even the simple clustering algorithm works well is that the ratio of inter-cluster 
distance to intra-cluster distance significantly increases after feature selection. Since the 
current clustering algorithm is sensitive to outliers, we plan to explore other clustering 
algorithms such as K-means. 

Once we detect a new cluster and generate an alarm, we examine further to identify 
the involved hosts and the files from which the cluster resulted. The suspicious hosts 
are just the ones whose file updates correspond to the feature vectors in the new cluster. 
To determine which files possibly cause the alarm, we only focus on the files picked by 
the wavelet-based selection to define the feature vector space. For each of those files, if 
it is updated by all the hosts in the new cluster during the detection window, but has not 
been updated by any host during the corresponding comparison window, Seurat outputs 
this file as a candidate file. Similarly, Seurat also reports the set of files that have been 
updated during the comparison window, but are not updated by any host in the new 
cluster during the detection window. 

Based on the suspicious hosts and the selected files for explaining root causes, sys- 
tem administrators can decide whether the updates are known administrative updates 
that should be suppressed, or some abnormal events that should be further investigated. 
If the updates are caused by malicious attacks, administrators can take remedial counter 
measures for the new cluster. Furthermore, additional compromised hosts can be iden- 
tified by checking if the new cluster expands later and if other hosts have updated the 
same set of candidate files. 

4 Experiments 

We have developed a multi-platform (Linux and Windows) prototype of Seurat that con- 
sists of a lightweight data collection tool and a correlation module. The data collection 
tool scans the file system of the host where it is running and generates a daily summary 
of file update attributes. Seurat harvests the summary reports from multiple hosts in a 
network system and the correlation module uses the reports for anomaly detection. 

We have installed the Seurat data collection tool on a number of campus office 
machines and a teaching cluster that are used by students daily. By default, the tool 
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scans the attributes of all system files on a host. For privacy reasons, personal hies 
under user home directories are not scanned. The attributes of a hie include the hie 
name, type, device number, permissions, size, inode number, important timestamps, and 
a 16-byte MD5 checksum of hie content. The current system uses only a binary bit to 
represent each hie update, but the next version may exploit other attributes reported by 
the data collection tool. Each day, each host compares the newly scanned disk snapshot 
against that from the previous day and generates a hie update summary report. In the 
current prototype, all the reports are uploaded daily to a centralized server where system 
administrators can monitor and correlate the hie updates using the correlation module. 

In this section, we study the effectiveness of Seurat’s pointillist approach for de- 
tecting aggregated anomalous events. We use the daily hie update reports from our real 
deployment to study the false positive rate and the corresponding causes in Section 4.1. 
We evaluate the false negative rate with simulated attacks in Section 4.2. In order to 
verify the effectiveness of our approach on real malicious attacks, we launched a real 
Linux worm into an isolated cluster and report the results in Section 4.3. 

4.1 False Positives 

The best way to study the effectiveness of our approach is to test it with real data. 
We have deployed Seurat on a teaching cluster of 22 hosts and have been collecting the 
daily hie update reports since Nov 2003. The teaching cluster is mostly used by students 
for their programming assignments. They are also occasionally used by a few graduate 
students for running network experiments. 

For this experiment, we use the hie update reports from Dec 1, 2003 until Feb 
29, 2004 to evaluate the false positive rate. During this period, there are a few days 
when a couple of hosts failed to generate or upload reports due to system failure or 
reconhgurations. For those small number of missing reports, we simply ignore them 
because they do not affect the aggregated hie update patterns. 

We set the correlation window to 32 days in order to accommodate monthly hie 
update patterns. That is, we correlate the update pattern from day 1 to day 32 to identify 
abnormal events on day 32, and correlate the update pattern from day 2 to day 33 to 
detect anomalies on day 33, etc. Thus, our detection starts from Jan 1, 2004, since we 
do not have 32-day correlation windows for the days in Dec 2003. 



Dimension Reduction. Once we hxed the correlation window of a particular day, we 
identify relevant hies using wavelet-based selection with a constant threshold a = 2 to 
dehne the feature vector space for simplicity. We then perform PCA to reduce the data 
dimensionality further by picking the hrst several principal components that account 
for 98% of the input data variances. 

Throughout the entire period of 91 days, 772 hies with unique hie names were 
updated by at least two different hosts. Figure 7 (a) shows the number of hosts that 
updated each hie during the data collection period. We observe that only a small num- 
ber hies (e.g.,/var/ adm/ syslog/mail . log) are updated regularly by all of the 
hosts, while most other hies (e.g., /var/run/named . pid) are updated irregularly, 
depending on the system usage or the applications running. 
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Fig. 7. Feature selection and dimension reduction: (a) File update patterns. Files are sorted by the 
cumulative number of hosts that have updated them throughout the 91 days. The darker the color 
is, the more hosts updated the corresponding file, (b) The number of feature vector dimensions 
after wavelet-based selection and PCA consecutively. 



Figure 7 (b) shows the results of feature selection. There were, on average, 140 files 
updated by at least two different hosts during each correlation window. After wavelet- 
based selection, the average number of feature dimensions is 17. PCA further reduces 
the vector space dimension to below 10. 



False Alarms. After dimension reduction, we perform clustering of feature vectors and 
identify new clusters for each day. Figure 8 illustrates the clustering results of 6 consec- 
utive days from Jan 19, 2004 to Jan 24, 2004. There are two new clusters identified on 
Jan 21 and Jan 23, which involve 9 hosts and 6 hosts, respectively. Since Seurat outputs 
a list of suspicious files as the cause of each alarm, system administrators can tell if the 
new clusters are caused by malicious intrusions. 

Based on the list of files output by Seurat, we can figure out that the new clusters on 
Jan 21 and Jan 23 reflect large scale file updates due to a system reconfiguration at the 
beginning of the spring semester. For both days, Seurat accurately pinpoints the exact 
hosts that are involved. The reconfiguration started from Jan 21, when a large number 
of binaries, header files, and library files were modified on 9 out of the 22 hosts. Since 
the events are known to system administrators, we treat the identified vectors as normal 
for future anomaly detection. Thus, no alarm is triggered on Jan 22, when the same 
set of library files were modified on 12 other hosts. On Jan 23, the reconfiguration 
continued to remove a set of printer files on 6 out of the 22 hosts. Again, administrators 
can mark this event as normal and we spot no new cluster on Jan 24, when 14 other 
hosts underwent the same set of file updates. 

In total, Seurat raises alarms on 9 out of the 60 days under detection, among which 
6 were due to system reconfigurations. Since the system administrators are aware of 
such events in advance, they can simply suppress these alarms. The 3 other alarms 
are generated on 3 consecutive days when a graduate student performed a network 
experiment that involved simultaneous file updates at multiple hosts. Such events are 
normal but rare, and should alert the system administrators. 
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Fig. 8. Clustering feature vectors for anomaly detection: Each circle represents a cluster. The 
number at the center of the figure shows the total number of clusters. The radius of a circle 
corresponds to the number of points in the cluster, which is also indicated beside the circle. The 
squared dots correspond to the new points generated on the day under detection. New clusters are 
identified by a thicker circle. 



4.2 False Negatives 

The primary goal of this experiment is to study the false negative rate and detection 
latency of Seurat as the stealthiness of the attack changes. We use simulated attacks by 
manually updating files on the selected host reports, as if they were infected. 

We first examine the detection rate of Seurat by varying the degree of attack ag- 
gressiveness. We model the attack propagation speed as the number of hosts infected 
on each day (the detection window), and model the attack stealthiness on a local host 
as the number of new files installed by this attack. Our simulation runs on the same 
teaching cluster that we described in Section 4.1. Since the aggregated file update pat- 
terns are different for each day, we randomly pick ten days in Feb 2004, when there was 
no intrusion. On each selected day, we simulate attacks by manually inserting artificial 
new files into a number of host reports on only that day, and use the modified reports 
as input for detection algorithm. We then remove those modified entries, and repeat the 
experiments with another day. The detection rate is calculated as the number of days 
that Seurat spots new clusters over the total ten days. 

Figure 9 shows the detection rate of Seurat by varying the number of files inserted 
on each host and the number of hosts infected. On one hand, the detection rate mono- 
tonically increases as we increase the number of files inserted on each host by an attack. 
Since the inserted files do not exist before, each of them will be selected as a feature di- 
mension by the wavelet-based selection, leading to larger distances between the points 
of infected host state changes and the points of normal host state changes. Therefore, 
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Fig. 9. Detection rate: We vary the number of hosts infected and the number of files inserted on 
each host by the simulated attacks. 



the more new files are injected by an attack, the higher the detection rate gets. On the 
other hand, as we increase the number of infected hosts, the number of points for ab- 
normal host state changes becomes large enough to create an independent new cluster. 
Thus, rapidly propagating attacks are more likely to be caught. Accordingly, detecting 
a slowly propagating attack requires a larger detection window, hence longer detection 
latency, in order to accumulate enough infected hosts. We revisit this issue in Section 5. 

We further evaluate the detection rate of Seurat on six Linux worms with simulated 
attacks. To do so, we compile a subset of files modified by each worm based on the 
descriptions from public Web sites such as Symantec [16] and F-Secure information 
center [17]. We then manually modify the described files in a number of selected host 
reports to simulate the corresponding worm attacks. Again, for each worm, we vary 
the number of infected hosts, and run our experiments on the teaching cluster with ten 
randomly selected days. 
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Fig. 10. Detection rate of emulated worms: We vary the number of hosts compromised by the 
attacks. 



Table 10 shows the number of files modified by each worm and the detection rate 
of Seurat. In general, the more files modified by a worm, the more likely the worm will 
be detected. But the position of a file in the file system directory tree also matters. For 
example, both Slapper-B worm and Kork worm insert 4 new files into a compromised 
host. However, Kork worm additionally modifies /etc/passwd to create accounts 
with root privileges. Because there are many hosts that have updated /etc/passwd 
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during a series of system reconfiguration events, the inclusion of such files in the feature 
vector space reduces the distances from abnormal points to normal points, resulting in 
higher false negative rates. We discuss this further in Section 5. 

4.3 Real Attacks 

Now we proceed to examine the efficacy of Seurat during a real worm outbreak. The 
best way to show this would be to have Seurat detect an anomaly caused by a new worm 
propagation. Instead of waiting for a new worm’s outbreak, we have set up an isolated 
computer cluster where, without damaging the real network, we can launch worms and 
record file system changes. This way, we have full control over the number of hosts 
infected, and can repeat the experiments. Because the isolated cluster has no real users, 
we merge the data acquired from the isolated cluster with the data we have collected 
from the teaching cluster in order to conduct experiments. 

We obtained the binaries and source codes of a few popular worms from public 
Web sites such as whitehats [18] and packetstorm [19]. Extensively testing Seurat, with 
various real worms in the isolated cluster, requires tremendous effort in setting up each 
host with the right versions of vulnerable software. As a first step, we show the result 
with the Lion worm [20] in this experiment. 

The Lion worm was found in early 2001. Lion exploits a vulnerability of BIND 
8.2, 8.2-Pl, 8.2.1, 8.2.2-Px. Once Lion infects a system, it sets up backdoors, leaks out 
confidential information (/etc/passwd, / etc/ shadow) via email, and scans the 
Internet to recruit vulnerable systems. Lion scans the network by randomly picking the 
first 16 bits of an IP address, and then sequentially probing all the 2^® IP addresses in 
the space of the block. After that. Lion randomly selects another such address block to 
continue scanning. As a result, once a host is infected by Lion, all the vulnerable hosts 
nearby (in the same IP address block) will be infected soon. Lion affects file systems: 
the worm puts related binaries and shell scripts under the / dev/ .lib directory, copies 
itself into the /tmp directory, changes system files under the / etc directory, and tries 
to wipe out some log files. 

We configured the isolated cluster with three Lion-vulnerable hosts and one addi- 
tional machine that launched the worm. The vulnerable machines were running RedHat 
6.2 including the vulnerable BIND 8.2.2-P5. The cluster used one C class network ad- 
dress block. Every machine in the cluster was connected to a 100Mbps Ethernet and 
was running named. The Seurat data collection tool generated a file system update 
report on every machine daily. 

After we launched the Lion worm, all three vulnerable hosts in the isolated cluster 
were infected quickly one after another. We merge the file update report by the each 
compromised host with a different normal host report generated on Eeb 11, 2004, when 
we know there was no anomaly. Figure 1 1 shows the clustering results of three consec- 
utive days from Feb 10, 2004 to Feb 12, 2004 using the merged reports. 

On the attack day, there are 64 files picked by the wavelet-based selection. The 
number of feature dimensions is reduced to 9 after PCA. Seurat successfully detects a 
new cluster consisting of the 3 infected hosts. Figure 12 lists the 22 files selected by 
Seurat as the causes of the alarm. These files provide enough hints to the administrators 
to confirm the existence of the Lion worm. Once detected, these compromised hosts as 
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Fig. 11. Intrusion detection by Seurat: Seurat identified a new cluster of three hosts on Feh 11, 
2004, when we manually launched the Lion worm. 



File ID. 


File name 


File ID. 


File name 


1 


/ sbin/asp 


12 


/var/ spool/mail 


2 


/ dev/ . lib 


13 


/ dev/ . lib/bindx . sh 


3 


/dev/ . lib/star. sh 


14 


/ tmp/ramen . tgz 


4 


/var/ spool /mail /root 


15 


/dev/ . lib/scan. sh 


5 


/dev/ . lib/bind 


16 


/dev/ . lib/pscan 


6 


/ etc/hosts . deny 


17 


/var/ spool/mqueue 


7 


/ dev/ . lib/randb 


18 


/ dev/ . lib /hack . sh 


8 


/ sbin 


19 


/ dev/ . lib/ . hack 


9 


/var/log 


20 


/dev/ . lib/index. html 


10 


/ dev/ . lib/bindname . log 


21 


/dev/ . Iib/asp62 


11 


/dev/ . lib/index. htm 


22 


/var/log/ sendmail . st 



Fig. 12. Suspicious files for the new cluster on Feb 11, 2004. 



well as the list of suspicious files can be marked for future detection. If, in the following 
days, there are more hosts that are clustered together with the already infected machines, 
or experience the same file updates, then we may conclude they are infected by the same 
attack. 



5 Discussion 

5.1 Vulnerabilities and Limitations 

By identifying parallel occurrences of coincident events, Seurat will be most successful 
in detecting virus or worm propagations that result in file modifications at multiple 
hosts. Certain attacks (e.g., password guessing attacks) that succeed only once or a few 
times in a network system may evade Seurat detection. The current prototype of Seurat 
also has limited detection capability to the following types of attacks. 

Stealthy attack. Attackers may try to evade detection by slowing attack propagation. If 
an attacker is patient enough to infect only one host a day in the monitored network 
system, Seurat will not notice the intrusion with the current one-day detection window 
because Seurat focuses only on anomalous file changes common across multiple hosts. 
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A larger detection window such as a couple of days or a week can help to catch slow, 
stealthy attacks. Note, however, that Seurat notices the attacks only after multiple hosts 
in the network system are compromised. In other words, if an attack propagates slowly, 
Seurat may not recognize the attack for the hrst few days after the initial successful 
compromise. There is thus a tradeoff between detection rate and detection latency. 
Mimicry attack. An attacker can carefully design his attack to cause hie updates that 
look similar to regular hie changes, and mount a successful mimicry attack [21]. There 
are two ways to achieve a mimicry attack against the current prototype. First, an at- 
tacker may try to fool Seurat’s feature selection process by camouhaging all intrusion 
hies as frequently, regularly updated hies. Those concealed hies, even when they are 
modihed in an unexpected way (e.g., entries removed from append-only log hies), will 
not be selected as feature vector dimensions because of current use of the binary feature 
representation. Note that Seurat’s data collection tool provides additional information 
on hie system changes, such as hie size, hie content digest, and permissions. By in- 
corporating the extra information in representing host state transition, Seurat can make 
such mimicry attacks harder. Second, an attacker may hnd a way to cloak abnormal hie 
updates with many normal but irregular changes during Seurat’s clustering process. For 
example, in Section 4.2, we observed that the false negative rate of detecting the Kork 
worm was relatively higher due to the interference of irregular system reconhguration. 
We leave it as future work to quantify this type of mimicry attack and the effectiveness 
of possible counter measures. 

Random-file-access attack. Seurat correlates hie updates based on their complete path 
names. Thus attackers can try to evade Seurat by installing attack hies under different 
directories at different hosts, or replacing randomly chosen existing hies with attack 
hies. Many recent email viruses already change the virus hie names when they propa- 
gate to a new host; we envision similar techniques could be employed by other types of 
attacks soon. Note, however, that even the random-hle-access attack may need a few an- 
chor hies at hxed places, where Seurat still has the opportunity to detect such attacks. A 
more robust representation of a hie, for example, an MD5 checksum, could help Seurat 
detect random-hle-access attacks. 

Memory-resident attack. Memory-resident and BIOS-resident only attacks make no hie 
system updates. Thus Seurat will not be able to detect memory resident attacks by 
examining host hie updates, nor those attacks that erase disk evidence before the Seurat 
data collection tool performs the scheduled disk scan. 

Kernel/Seurat modification attack. The effectiveness of Seurat relies on the correctness 
of reports from the data collecting tools running on distributed hosts. So the host ker- 
nels and the Seurat data collection tools should run on machines protected by trusted 
computing platforms [22]. An alternative solution is to monitor hie system changes in 
real time (will be discussed further in Section 5.2) and to protect hie update logs using 
secure audit logging [23]. 



5.2 Future Work 

Real-time anomaly detection. The current prototype periodically scans and reports hie 
system updates with a 1-day cycle, which may be slow to detect fast propagating at- 
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tacks. To shorten detection latency, we are enhancing the Seurat data collection module 
to monitor system calls related with file updates, and report the changes immediately to 
the correlation module. The reported file updates will be instantly reflected by setting 
the corresponding bits in the feature vectors at the Seurat correlation module, which 
continuously performs clustering of the new feature vectors for real time anomaly de- 
tection. 

Distributed correlation module. Currently, Seurat moves the daily reports from dis- 
tributed data collection tools to a centralized server, where the correlation module com- 
putes and clusters the host vectors. Despite the simplicity of centralized deployment, the 
centralized approach exposes Seurat to problems in scalability and reliability. First, the 
amount of report data to be transferred to and stored at the centralized server is large. In 
our experience, a host generates a file update report of 3K-140KBytes daily in a com- 
pressed format, so the aggregate report size from hundreds or thousands of hosts with a 
long comparison window will be large. The report size will be larger when Seurat’s data 
collection tool reports the host state changes in real time. Second, the monitored hosts 
could be in different administrative domains (i.e., hosts managed by different academic 
departments or labs) and it is often impractical to transfer the detailed reports from all 
the hosts to one centralized server due to privacy and confidentiality issues. Third, the 
centralized server can be a single point-of-failure. It is important for Seurat to work 
even when one correlation server is broken or a part of network is partitioned. A dis- 
tributed correlation module will cope with those issues. We are now investigating meth- 
ods to correlate file update events in a distributed architecture such as EMERALD [1], 
AAFID [24], and Mingle [25]. 

Other applications. The approach of clustering coincident host state changes can be 
generalized to other types of applications such as detecting the propagation of spyware, 
illegal file sharing events, or erroneous software configuration changes. We are currently 
deploying Seurat on Planetlab [26] hosts for detecting software configuration errors by 
identifying host state vectors that do not fall into an expected cluster. 

6 Related Work 

Seurat uses file system updates to represent a host state change. File system updates 
have been known to be useful information for intrusion detection. Tripwire [4], AIDE 
[8], Samhain [27] are well-known intrusion detection systems that use file system up- 
dates to find intrusions. Recently proposed systems such as the storage-based intrusion 
detection systems [7] and some commercial tools [28] support real-time integrity check- 
ing. All of them rely on a pre-defined rule set to detect anomalous integrity violation, 
while Seurat diagnoses the anomaly using learning and correlation across time and 
space. 

Leveraging the information gathered from distributed multiple measurement points 
is not a new approach. Many researchers have noticed the potential of the collective 
approaches for intrusion detection or anomaly detection. Graph-based Intrusion De- 
tection System (GrIDS) [29] detects intrusions by building a graph representation of 
network activity based on the report from all the hosts in a network. Different from 
Seurat, GrIDS uses the TCP/IP network activity between hosts in the network to infer 
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patterns of intrusive or hostile activities based on pre-defined rules. Other systems, such 
as Cooperative Security Managers (CSM) [30], Distributed Intrusion Detection System 
(DIDS) [31], also take advantage of a collective approach to intrusion detection. They 
orchestrate multiple monitors watching multiple network links and track user activity 
across multiple machines. 

EMERALD (Event Monitoring Enabling Responses to Anomalous Live Distur- 
bances) [1] and Autonomous Agents Eor Intrusion Detection (AAEID) [24] have in- 
dependently proposed distributed architectures for intrusion detection and response ca- 
pability. Both of them use local monitors or agents to collect interesting events and 
anomaly reports (from a variety of sources; audit data, network packet traces, SNMP 
traffic, application logs, etc.). The architectures provide the communication methods to 
exchange the locally detected information and an easy way to manage components of 
the systems. AAEID performs statistical profile-based anomaly detection and EMER- 
ALD supports a signature-based misuse analysis in addition to the profile-based 
anomaly detection. Note that Seurat starts with similar motivation. But Seurat focuses 
more on the technique for correlating the collective reports for anomaly detection, and 
infers interesting information on the system state from learning, rather than relying on a 
pre-defined set of events or rules. We envision Seurat as a complementary technique, not 
as a replacement of the existing architectures that provide global observation sharing. 

Correlating different types of audit logs and measurement reports is another active 
area in security research. Many researchers have proposed to correlate multiple het- 
erogeneous sensors to improve the accuracy of alarms [3,32,33,2,34]. In this work, 
we attempt to correlate information gathered by homogeneous monitors (especially, the 
file system change monitors) but we may enhance our work to include different type of 
measurement data to represent individual host status. 

Wang et al. [35] also have noticed the value of spatial correlation of multiple sys- 
tem configurations and applied a collective approach to tackle misconfiguration trouble 
shooting problems. In their system, a malfunctioning machine can diagnose its problem 
by collecting system configuration information from other similar and friendly hosts 
connected via a peer-to-peer network. The work does not target automatic detection of 
the anomaly, but rather it aims at figuring out the cause of a detected problem. 



7 Conclusions 

In this paper, we presented a new “pointillist” approach for detecting aggregated anoma- 
lous events by correlating information about host file updates across both space and 
time. Our approach explores the temporal and spatial locality of system state changes 
through learning and correlation. It requires neither prior knowledge about normal host 
activities, nor system specific rules. 

A prototype implementation, called Seurat, suggests that the approach is effective in 
detecting rapidly propagating attacks that modify host file systems. The detection rate 
degrades as the stealthiness of attacks increases. By trading off detection latency, we 
are also able to identify hosts that are compromised by slowly propagating attacks. Eor 
each alarm, Seurat identifies suspicious files and hosts for further investigation, greatly 
facilitating root cause diagnosis and false alarm suppression. 
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Abstract. Intruders on the Internet often prefer to launch network in- 
trusions indirectly, i.e., using a chain of hosts on the Internet as relay 
machines using protocols such as Telnet or SSH. This type of attack is 
called a stepping-stone attack. In this paper, we propose and analyze al- 
gorithms for stepping-stone detection using ideas from Computational 
Learning Theory and the analysis of random walks. Our results are the 
first to achieve provable (polynomial) upper bounds on the number of 
packets needed to confidently detect and identify encrypted stepping- 
stone streams with proven guarantees on the probability of falsely accus- 
ing non-attacking pairs. Moreover, our methods and analysis rely on mild 
assumptions, especially in comparison to previous work. We also examine 
the consequences when the attacker inserts chaff into the stepping-stone 
traffic, and give bounds on the amount of chaff that an attacker would 
have to send to evade detection. Our results are based on a new approach 
which can detect correlation of streams at a fine-grained level. Our ap- 
proach may also apply to more generalized traffic analysis domains, such 
as anonymous communication. 

Keywords: Network intrusion detection. Evasion. Stepping stones. In- 
teractive sessions. Random walks. 



1 Introduction 

Intruders on the Internet often launch network intrusions indirectly, in order to 
decrease their chances of being discovered. One of the most common methods 
used to evade surveillance is the construction of stepping stones. In a stepping- 
stone attack, an attacker uses a sequence of hosts on the Internet as relay ma- 
chines and constructs a chain of interactive connections using protocols such as 
Telnet or SSH. The attacker types commands on his local machine and then the 
commands are relayed via the chain of “stepping stones” until they finally reach 
the victim. Because the final victim only sees traffic from the last hop of the 
chain of the stepping stones, it is difficult for the victim to learn any informa- 
tion about the true origin of the attack. The chaotic nature and sheer volume 
of the traffic on the Internet makes such attacks extremely difficult to record or 
trace back. 

To combat stepping-stone attacks, the approach taken by previous research 
(e.g., [1-4]), and the one that we adopt, is to instead ask the question “What 
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can we detect if we monitor traffic at the routers or gateways?” That is, we 
examine the traffic that goes in and out of routers, and try to detect which 
streams, if any, are part of a stepping-stone attack. This problem is referred 
to as the stepping-stone detection problem. A stepping-stone monitor analyzes 
correlations between flows of incoming and outgoing traffic which may suggest 
the existence of a stepping stone. Like previous approaches, in this paper we 
consider the detection of interactive attacks: those in which the attacker sends 
commands through the chain of hosts to the target, waits for responses, sends 
new commands, and so on in an interactive session. Such traffic is characterized 
by streams of packets, in which packets sent on the first link appear on the next 
a short time later, within some maximum tolerable delay bound A. Like previous 
approaches, we assume traffic is encrypted, and thus the detection mechanisms 
cannot rely on analyzing the content of the streams. We will call a pair of streams 
an attacking pair if it is a stepping-stone pair, and we will call a pair of streams 
a non-attacking pair if it is not a stepping-stone pair. 

Researchers have proposed many approaches for detecting stepping stones 
in encrypted traffic, (e.g., [1-3]. See more detailed related work in Section 2.) 
However, most previous approaches in this area are based on ad-hoc heuristics 
and do not give any rigorous analysis that would provide provable guarantees of 
the false positive rate or the false negative rate [2, 3]. Donoho et al. [4] proposed a 
method based on wavelet transforms to detect correlations of streams, and it was 
the first work that performed rigorous analysis of their method. However, they do 
not give a bound on the number of packets that need to be observed in order to 
detect attacks with a given level of confidence. Moreover, their analysis requires 
the assumption that the packets on the attacker’s stream arrive according to 
a Poisson or a Pareto distribution - in reality, the attacker’s stream may be 
arbitrary. Wang and Reeves [5] proposed a watermark-based scheme which can 
detect correlation between streams of encrypted packets. However, they assume 
that the attacker’s timing perturbation of packets is independent and identically 
distributed (iid), and their method breaks when the attacker perturbs traffic in 
other ways. 

Thus, despite the volume of previous work, an important question still re- 
mains open: how can we design an efficient algorithm to detect stepping-stone 
attacks with (a) provable bounds on the number of packets that need to be mon- 
itored, (b) a provable guarantee on the false positive and false negative rate, and 
(c) few assumptions on the distributions of attacker and normal traffic? 

The paper sets off to answer this question. In particular, in this paper we use 
ideas from Computational Learning Theory to produce a strong set of guarantees 
for this problem: 

Objectives: We explicitly set our objective to be to distinguish attacking pairs 
from non-attacking pairs, given our fairly mild assumptions about each. In 
contrast, the work of Donoho et al. [4] detects only if a pair of streams 
is correlated. This is equivalent to our goal if one assumes non-attacking 
pairs are perfectly uncorrelated, but that is not necessarily realistic and 
our assumptions about non-attacking pairs will allow for substantial coarse- 
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grained correlation among them. For example, if co-workers work and take 
breaks together, their typing behavior may be correlated at a coarse-grained 
level even though they are not part of any attack. Our models allow for this 
type of behavior on the part of “normal” streams, and yet we will still be 
able to distinguish them from true stepping-stone attacks. 

Fewer assumptions: We make very mild assumptions, especially in compar- 
ison with previous work. For example, unlike the work by Donoho et ah, 
our algorithm and analysis do not rely on the Poisson or Pareto distribu- 
tion assumption on the behavior of the attacking streams. By modeling a 
non-attack stream as a sequence of Poisson processes with varying rates and 
over varying time periods, our analysis results can apply to almost any dis- 
tribution or pattern of usage of non-attack and attack streams. This model 
allows for substantial high-level correlation among non-attackers. 

Provable bounds: We give the first algorithm for detecting stepping-stone at- 
tacks that provides (a) provable bounds on the number of packets needed 
to confidently detect and identify stepping-stone streams, and (b) provable 
guarantees on false positive rates. Our bounds on the number of packets 
needed for confident detection are only quadratic in terms of certain natural 
parameters of the problem, which indicates the efficiency of our algorithm. 
Stronger results with chaff: We also propose detection algorithms and give 
a hardness result when the attacker inserts “chaff” traffic in the stepping- 
stone streams. Our analysis shows that our detection algorithm is effective 
when the attacker inserts chaff that is less than a certain threshold fraction. 
Our hardness results indicate that when the attacker can insert chaff that 
is more than a certain threshold fraction, the attacker can make the attack- 
ing streams mimic two independent random processes, and thus completely 
evade any detection algorithm. Note that our hardness analysis will apply 
even when the monitor can actively manipulate the timing delay. Our results 
on the chaff case are also a significant advance from previous work. The work 
of Donoho et al. [4] assumes that the chaff traffic inserted by the attacker 
is a Poisson process independent from the non-chaff traffic in the attacking 
stream, while our results make no assumption on the distribution of the chaff 
traffic. 

The type of guarantee we will be able to achieve is that given a confidence 
parameter 5, our procedure will certify a pair as attacking or non-attacking 
with error probability at most <5, after observing a number of packets that is 
only quadratic in certain natural parameters of the problem and logarithmic 
in 1/(5. Our approach is based on a connection to sample-complexity bounds 
in Computational Learning Theory. In that setting, one has a set or sequence 
of hypotheses hi,h 2 , ■ ■ ■, and the goal is to identify which if any of them has 
a low true error rate from observing performance on random examples [6-8]. 
The type of question addressed in that literature is how much data does one 
need to observe in order to ensure at most some given S probability of failure. 
In our setting, to some extent packets play the role of examples and pairs of 
streams play the role of hypotheses, though the analogy is not perfect because 




Detection of Interactive Stepping Stones: Algorithms and Confidence Bounds 261 



it is the relationship between packets that provides the information we use for 
stepping-stone detection. 

The high-level idea of our approach is that if we consider two packet streams 
and look at the difference between the number of packets sent on them, then 
this quantity is performing some type of random walk on the one-dimensional 
line. If these streams are part of a stepping-stone attack, then by the maximum- 
tolerable delay assumption, this quantity will never deviate too far from the 
origin. However, if the two streams are not part of an attack, then even if the 
streams are somewhat correlated, say because they are Poisson with rates that 
vary in tandem, this walk will begin to experience substantial deviation from the 
origin. There are several subtle issues: for example, our algorithm may not know 
in advance what an attacker’s tolerable delay is. In addition, new streams may 
be arriving over time, so if we want to be careful not to have false-positives, we 
need to adjust our confidence threshold as new streams enter the system. 

Outline. In the rest of the paper, we first discuss related work in Section 2, then 
give the problem definition in Section 3. We then describe the stepping-stone 
detection algorithm and confidence bounds analysis in Section 4. We consider 
the consequences of adding chaff in Section 5. We finally conclude in Section 6. 

2 Related Work 

The initial line of work in identifying interactive stepping stones focused on 
content-based techniques. The interactive stepping stone problem was first for- 
mulated and studied by Staniford and Heberlein [1]. They proposed a content- 
based algorithm that created thumbprints of streams and compared them, look- 
ing for extremely good matches. Another content-based approach. Sleepy Water- 
mark Tracing, was proposed by Wang et al. [10]. These content-based approaches 
require that the content of the streams under consideration do not change signif- 
icantly between the streams. Thus, for example, they do not apply to encrypted 
traffic such as SSH sessions. 

Another line of work studies correlation of streams based on connection tim- 
ings. Zhang and Paxson [2] proposed an algorithm for encrypted connection 
chains based on periods of activity of the connections. They observed that in 
stepping stones, the ON-periods and OFF-periods will coincide. They use this 
observation to detect stepping stones, by examining the number of consecutive 
OFF-periods and the distance of the OFF-periods. Yoda and Etoh [3] proposed 
a deviation-based algorithm to trace the connection chains of intruders. They 
computed deviations between a known intruder stream and all other concurrent 
streams on the Internet, compared the packets of streams which have small de- 
viations from the intruder’s stream, and utilize these analyses to identify a set 
of streams that match the intruder stream. Wang et al. [11] proposed another 
timing-based approach that uses the arrival and departure times of packets to 
correlate connections in real-time. They showed that the inter-packet timing 
characteristics are preserved across many router hops, and often uniquely iden- 
tify the correlations between connections. These algorithms based on connection 
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timings, however, are all vulnerable to active timing pertubation by the attacker 
- they will not be able to detect stepping stones when the attacker actively 
perturbs the timings of the packets on the stepping-stone streams. 

We are aware of only two papers [4, 5] that study the problem of detecting 
stepping-stone attacks on encrypted streams with the assumption of a bound on 
the maximum delay tolerated by the attacker. In Section 1, we discuss the work 
of Donoho et al. [4] in relation to our paper. We note that their work does not 
give any bounds on the number of packets needed to detect correlation between 
streams, or a discussion of the false positives that may be identified by their 
method. Wang and Reeves [5] proposed a watermark-based scheme, which can 
detect correlation between streams of encrypted packets. However, they assume 
that the attacker’s timing perturbation of packets is independent and identically 
distributed (iid). Our algorithms do not require such an assumption. Further, 
they need to actively manipulate the inter-packet delays in order to embed and 
detect their watermarks. In contrast, our algorithms require only passive moni- 
toring of the arrival times of the packets. 

Wang [12] examined the problem of determining the serial order of correlated 
connections in order to determine the intrusion path, when given the complete 
set of correlated connections. 

3 Problem Definition 

Our problem definition essentially mirrors that of Donoho et al. [4] . A stream is 
a sequence of packets that belong to the same connection. We assume that the 
attacker has a maximum delay tolerance A, which we may or may not know. 
That is, for every packet sent in the first stream, there must be a corresponding 
packet in the second stream between 0 and A time steps later. The notion of 
maximum delay bound was first proposed by Donoho et al. [4]. We also assume 
that there is a maximum number of packets that the attacker can send in a 
particular time interval t, which we call pt- We note that pA is unlikely to be 
very large, since we are considering interactive stepping-stone attacks. As in prior 
work, we assume that a packet on either stream maps to only one packet on the 
other stream (i.e., packets are not combined or broken down in any manner). 

Similar to previous work, we do not pay attention to the content or the sizes 
of the packets, since the packets may be encrypted. We assume that the real- 
time traffic delay between packets is very small compared to A, and ignore it 
everywhere. We have a stepping-stone monitor that observes the streams going 
through the monitor, and keeps track of the total number of packets on each 
stream at each time of observation. We denote the total number of packets in 
stream i by time t as or simply W if t is the current time step. 

By our assumptions, for a pair of stepping-stone streams 81 , 82 , the following 
two conditions hold for the true packets of the streams, i.e., not including chaff 
packets: 

1. 7Vi(t) > N 2 {t). 

Every packet in stream 2 comes from stream 1. 
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2. Ni{t) < N2{t + A). 

All packets in stream 1 must go into stream 2 - i.e., no packets on stream 
1 are lost enroute to stream 2, and all the packets on stream 1 arrive on 
stream 2 within time A. 

If the attacker sends no chaff on his streams, then all the packets on a stepping 
stone pair will obey the above two conditions. 

We will find it useful to think about the number of packets in a stream in 
terms of the total number of the packets observed in the union of two streams: 
in other words, viewing each arrival of a packet in the union of the two streams 
as a “time step”. We will use Ni{w) for the number of packets in stream i, when 
there are a total of w packets in the union of the two streams. 

In Section 4.1, we assume that a normal stream i is generated by a Poisson 
process with a constant rate Xi- In Section 4.2, we generalize this, allowing for 
substantial high-level correlation between non-attacking streams. Specifically, 
we model a non-attacking stream as a “Poisson process with a knob”, where 
the knob controls the rate of the process and can be adjusted arbitrarily by 
the user with time. That is, the stream is really generated by a sequence of 
Poisson processes with varying rates for varying lengths of time. Even if two 
non-attacking streams correlate by adjusting their knobs together ~ e.g., both 
having a high rate at certain times and low rates at others - our procedure will 
nonetheless (with high probability) not be fooled into falsely tagging them as an 
attacking pair. 

The guarantees produced by our algorithm will be described by two quanti- 
ties: 



— a monitoring time M measured in terms of total number of packets that 
need to be observed on both streams, before deciding whether the pair of 
streams is an attack pair, and 

— a false-positive probability S, given as input to the algorithm (also called 
the confidence level), that describes our willingness to falsely accuse a non- 
attacking pair. 

The guarantees we will achieve are that (a) any stepping-stone pair will be 
discovered after M packets, and (b) any normal pair has at most a 5 chance of 
being falsely accused. Our algorithm will never fail to flag a true attacking pair, 
so long as at least M packets are observed. For instance, our first result. Theorem 
1, is that if non-attacking streams are Poisson, then M = log | packets are 
sufficient to detect a stepping-stone attack with false-positive probability 5. One 
can also adjust the confidence level with the number of pairs of streams being 
monitored, to ensure at most a 5 chance of ever falsely accusing a normal pair. 

All logarithms in this paper are base 2. Table I summarizes the notation we 
use in this paper. 
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Table 1. Summary of notation 



A 


maximum tolerable delay bound 


PA 


maximum number of packets that may be sent in time interval A. 


s 


false positive probability 


Si 


stream i 


M 


number of packets that we need to observe on the union of the two streams 
in the detection algorithms 


Ni{t) 


number of packets sent on stream i in time interval t. 


Ni{w) 


number of packets sent on stream i when a total of w packets is present on 
the union of the pair of stream under consideration. 



4 Main Results: Detection Algorithms 
and Confidence Bounds Analysis 

In this section, we give an algorithm that will detect stepping stones with a low 
probability of false positives. We only consider streams that have no chaff, which 
means that every packet on the second stream comes from the first stream, and 
packets can only be delayed, not dropped. We will discuss the consequences of 
adding chaff in Section 5. 

Our guarantees give a bound on the number of packets that need to be 
observed to confidently identify an attacker. These bounds have a quadratic de- 
pendence on the maximum tolerable delay A (or more precisely, on the number 
of packets pA an attacker can send in that time frame), and a logarithmic de- 
pendence on 1/(5, where S is the desired false-positive probability. The quadratic 
dependence on maximum tolerable delay comes essentially from the fact that 
on average it takes 0(p^) steps for a random walk to reach distance p from the 
origin. Our basic bounds assume the value of pA is given to the algorithm (The- 
orems 1 and 2); we then show how to remove this assumption, increasing the 
monitoring time by only an O(loglogpzi) factor (Theorem 3). 

We begin in Section 4.1 by considering a simple model of normal streams - 
we assume that any normal stream Si can be modeled as a Poisson process, with 
a fixed Poisson rate A^. We then generalize this model in Section 4.2. We make 
no additional assumptions on the attacking streams. 

4.1 A Simple Poisson Model 

We first describe our detection algorithm and analysis for the case that pa is 
known, and then later show how this assumption can be removed. 

The Detection Algorithm. Our algorithm is simple and efficient: for a given 
pair of streams, the monitor watches the packet arrivals, and counts packets 
on both streams until the total number of packets (on both streams) reaches a 
certain threshold ^ b The monitor then computes the difference in the number 

^ The intuition for the parameters as well as the proof of correctness is in the analysis 
section. 
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Detect- Attacks (S,pa) 

Set m = log i, n = 

For m iterations 

Observe n packets on Si U S 2 . 

Compute d = Ni — N 2 . If d > pA return Normal. 
return Attack. 



Fig. 1. Algorithm for stepping-stone detection (without chaff) with a simple Poisson 
model 



of packets of the two streams - if the difference exceeds the packet bound pA, 
we know the streams are normal; otherwise, it restarts. If the difference stays 
bounded for a sufficiently long time (log | such trials of packets), the monitor 
declares that the pair of streams is a stepping stone. The algorithm is shown in 
Fig. 1. 

We note that the algorithm is memory-efficient - we only need to keep track 
of the number of packets seen on each stream. We also note that the algorithm 
does not need to know or compute the Poisson rates; it simply needs to observe 
the packets coming in on the streams. 



Analysis. We first note that, by design, our algorithm will always identify a 
stepping-stone pair, providing they send M packets. We then show that the false 
positive rate of <5 is also achieved by the algorithm. Under the assumption that 
normal streams may be modeled as Poisson processes, we show three analytical 
results in the following analysis: 

1. When PA is known, the monitor needs to observe no more than M = 

log packets on the union of the two streams under consideration, to 
guarantee a false positive rate of 5 for any given pair of streams (Theorem 1). 

2. Suppose instead that we wish to achieve a 5 probability of false positive over 

all pairs of streams that we examine. For instance, we may wish to achieve 
a false positive rate of d over an entire day of observations, rather than over 
a particular number of streams. When pA is known, the monitor needs to 
observe no more than M = log packets on the union of the zth 

pair of streams, to guarantee a <5 chance of false positive among all pairs of 
streams it examines (Theorem 2). 

3. When pA is unknown, we can achieve the above guarantees with only an 
O(loglogpzi) factor increase in the number of additional packets that need 
to observe (Theorem 3). 

Below, we first give some intuition and then the detailed theorem statements 
and analysis. 

Intuition. We first give some intuition behind the analysis. Consider two normal 
streams as Poisson processes with rates Ai and A 2 . We can treat the difference 
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between two Poisson processes as a random walk, as shown in Fig. 2. Consider a 
sequence of packets generated in the union of the two streams. The probability 
that a particular packet is generated by the first stream is (which we 

denote /ri), and probability that it is generated by the second stream is 
(which we call fj. 2 ). We can define a random variable Z to be the difference 
between the number of packets generated by the streams. Every time a packet is 
sent on either Si or S 2 , Z increases by 1 with probability fii, and decreases by 
1 with probability ^ 2 - It is therefore a one-dimensional random walk. We care 
about the expected time for Z to exit the bounded region [0,p/i], given that 
it begins at some arbitrary point inside this range. If Z < 0, then the second 
stream has definitely a packet that the first stream did not; ii Z > pA, then the 
delay bound is violated. 



Ai Ai Ai Ai 




Fig. 2. (a) Packets arriving in the two streams, (b) Viewing the arrival of packets as a 
random walk with rates Ai and A2 



Theorem 1. Under the assumption that normal streams behave as Poisson pro- 
cesses, the algorithm Detect- Attacks will correctly detect stepping-stone at- 
tacks with a false positive probability at most S for any given pair of streams, 
after monitoring log y packets on the union of the two streams. 

Proof. Let 0 < Ni{w) — N 2 {w) < pA at total packet count w. Then, after n 
further packet arrivals, we want to bound the probability that the difference is 
still within [0,p/i]. Let Z = Ni{w + n) — N 2 {w + n). For any given x, we have: 



Pr[Z = x] 




M2" ■ 



Using Stirling’s approximation, for 0 < a; < pA ^ n 



Pr[Z = a:] < 



n/2 n/2 

Ml M2 X -X 

=M?M2 < 






1 

\jTmj2 



Therefore, over the interval of length pA , 



p/i 

Pr[0 < Z < Pa] < Pr[Z 

x—0 



a;] < 



PA 

Y^7rn/2 



Substituting n 



8Pa 



, we get Pr[0 < Z < pa] < 



1 

2 ■ 
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To ensure that this is bounded by the given confidence level, we take m such 
observations of n time steps, so that (i) < 5, or 

w > log - . 

0 

We need to observe m sets of n packets; therefore, we need log | intervals. □ 

We have just shown in Theorem 1 that our algorithm in Fig. 1 will identify 
any given stepping-stone pair correctly, and will have a probability <5 of a false 
positive for any given non-attacking pair of streams. We can also modify our 
algorithm so that it only has a probability i5 of a false positive among all the 
pairs of streams that we observe. That is, given S, we distribute it over all the 
pairs of streams that we can observe, by allowing only probability of false 

positive for the zth pair of streams, and using the fact that 
To see why this might be useful, suppose S = 0.001. Then, we would expect to 
falsely accuse one pair out of every 1000 pairs of (normal) streams. It could be 
more useful at times to be able to give a false positive rate of 0.001 over an entire 
month of observations, rather than give that rate over a particular number of 
streams. 

Theorem 2. Under the assumption that normal streams behave as Poisson pro- 
cesses, the algorithm Detect- Attacks will have a probability at most S of a 
false positive among all the pairs of streams it examines if, for the ith pair of 
streams, it uses a monitoring time of log packets. 

Proof. We need to split our allowed false positives S among all the pairs we will 
observe; however, since we do not know the number of pairs in advance, we do 
not split the 6 evenly. 

Instead, we allow the zth pair of streams a false positive probability of 
and then use the previous algorithm with the updated false positive level. The 
result then follows from Theorem 1 and the fact that = 6. □ 

z(2 + l) 

The arguments so far assume that the algorithm knows the quantity pA- 
We now remove this assumption by using a “guess and double” strategy. Let 
Pj = 2P When a pair of streams is “cleared” as not being a stepping-stone 
attack with respect to pj, we then consider it with respect to Pj+i. By setting 
the error parameters appropriately, we can maintain the guarantee that any 
normal pair is falsely accused with probability at most 6, while guaranteeing 
that any attacking pair will be discovered with a monitoring time that depends 
only on the actual value of pA. Thus, we can still obtain strong guarantees. In 
addition, even though this algorithm “never” finishes monitoring a normal pair 
of streams, the time between steps at which the monitor compares the difference 
Ni — N 2 increases over the sequence. This means that for the streams that have 
been under consideration for a long period of time, the monitor tests differences 
less often, and thus does not need to do substantial work, so long as the stream 
counters are running continuously. 
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Theorem 3. Assume that normal streams behave as Poisson processes. Then, 
even if pa is unknown, we can use algorithm Detect- Attacks as a subroutine 
and have a false positive probability at most S, while correctly catching stepping- 
stone attacks within 0(p^(loglogp/i -I- log j-)) packets, where pA is the actual 
maximum value of Ni{t) — N2{t) for the attacker. 

Proof. As discussed above, we run Detect- Attacks using a sequence of “p/i” 
values pj, where pj = 2^, incrementing j when the algorithm returns Normal. 
As in Theorem 2, we use Ets our false-positive probability on iteration j, 

which guarantees having at most a 6 false-positive probability overall. We now 
need to calculate the monitoring time. For a given attacking pair, the number 
of packets needed to catch it is at most: 

T'^iog^(^. 

7T S 

i=i 

Since the entries in the summation are more than doubling with j, the sum is 
at most twice the value of its largest term, and so the total monitoring time is 

0(p^(loglogpzi +log^)). □ 

4.2 Generalizing the Poisson Model 

We now relax the assumption that a normal process is Poisson with a fixed rate 
A. Instead, we assume that a normal process can be modeled as a sequence of 
Poisson processes, with varying rates, and over varying time periods. From the 
point of view of our algorithm, one can view this as a Poisson process with a 
user-adjustable “knob” that is being controlled by an adversary to fool us into 
making a false accusation. 

Note that this is a general model; we could use it to coarsely approximate 
almost any distribution, or pattern of usage. For example, at a high level, this 
model could approximately simulate Pareto distributions which are thought to 
be a good model for users’ typing patterns [13], by using a Pareto distribution 
to choose our Poisson rates for varying time periods, which could be arbitrarily 
small. Correlated users can be modeled as having the same sequence of Poisson 
rates and time intervals: for example, co-workers may work together and take 
short or long breaks together. 

Formally, for a given pair of streams, we will assume the first stream is a 
sequence given by (Aii,tn), (Ai2,ti2), • ■ •, and the second stream by (A2i,t2i)) 
(^22,^22)5 ■ • Let Ni{t) denote the number of packets sent in stream i by time 
t. Then, the key to the argument is that over any given time interval T, the 
number of packets sent by stream i is distributed according to a Poisson process 
with a single rate Xi^T, which is the weighted mean of the rates of all the Poisson 
processes during that time. That is, if time interval T contains a sequence of 
time intervals jstart, ■ ■ -,jend, then Ai,r = (breaking intervals 

if necessary to match the boundaries of T). 
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Theorem 4. Assuming that normal streams behave as sequences of Poisson pro- 
cesses, the algorithm Detect- Attacks will have a false positive rate of at most 
5, if it observes at least ^ log intervals of n packets each, where n= ■ 



Proof. Let S{t) be the number of packets on the union of the streams at time t. 
Let D{t) be the difference in the number of packets at time t, i.e. Ni{t) — N 2 {t). 
Let h = 



We define T to be the time when Pr[S(fP) > h] = i, and let T' >T. Then, 



Pr[D{T') ^ /] = Pr[D{T') I\S{T') > h]Pr[S{T') > h] 
+ Pr[D{r) ef I\S(fP') < n]Pr[5'(T') < h], 
> Pr[D{T') 1\S{T') > n]Pr[5'(T') > h], 

>^Pr[D{r)^i\s{r)>h], 

= ^{i-Pr[D{r)Gi\s{r)>h]). 

From the proof of Theorem 1 , we know 

Pr[D{T') G I\S{T') >h]< 

y'Tm/z 

Therefore, Pr[D{T') ^ I] > i(l ). 

2 y'Trhl2 



Substituting h = we get Pr[D{T') ^ I] > j. 

Now, note that Pr[S{T) > kh] < Therefore, Pr[t < T\S{f) > kh] < ^ 
Then, 

Pr[D{t) I\S{t) > kh] = Pr[D(t) l\t > T]Pr[t > T\S{t) > kh] 

+ Pr[D(t) l]t < T]Pr[t < T]S{t) > kh] 

> Pr[D{t) l]t > T]Pr[t > T]S{t) > kh] 




Therefore, Pr[D{t) G I]S{t) > kh] < 1 — ^ (l — ^). 

To bound this by the given confidence level, we need to take m such obser- 
vations of kh packets in the union of the streams, so that: 



Since 




Setting k = 2, 





< S. 



m > 



log? 

log(i)- 



< |, we set TO > |log f 



□ 
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Likewise, we have the analogues of Theorem 2 and Theorem 3 for the general 
model. We omit their proofs, since they are very similar to the proofs of Theo- 
rem 2 and Theorem 3. 



Theorem 5. Assuming that normal streams behave as sequences of Poisson pro- 
cesses, the algorithm Detect- Attacks will have a probability at most S of a 
false positive over all pairs of streams it examines, if, for the ith pair of streams. 



it observes | log 






intervals of n packets each, where n = 



16 p/ 



Theorem 6. Assuming that normal streams behave as sequences of Poisson pro- 
cesses, then if Pa is unknown, we can use repeated- doubling and incur an extra 
0(loglogp/i) factor in the number of packets over that in Theorem 5, to achieve 
false-positive probability S. 



5 Chaff: Detection and Hardness Resnlt 

All the results in Section 4 rely on the attacker streams obeying two assumptions 
in Section 3 ~ in a pair of attacker streams, every packet sent on the first stream 
arrives on the second stream, and any packet that arrives on the second stream 
arrives from the first stream. In this section, we examine the consequences of 
relaxing these assumptions. 

Notice that only the packets that must reach the target need to obey these 
two assumptions. However, the attacker could insert some superfluous packets 
into either of the two streams, that do not need to reach the target, and therefore, 
do not have to obey the assumptions. Such extraneous packets are called chajf. 
By introducing chaff into the streams, the attacker would try to ensure that the 
number of packets observed in his two streams appear less correlated, and thus 
reduce the chances of being detected. 

Donoho et al. [4] also examine the consequences of the addition of chaff to 
attack streams. They show that under the assumption that the chaff in the 
streams is generated by a Poisson process that is independent of the non-chaff 
packets in the stepping-stone streams, it is possible to detect correlation between 
stepping-stone pairs, as long as the streams have sufficient packets. However, an 
attacker may not wish to generate chaff as a Poisson process. In this section, 
we assume that a clever attacker will want to optimize his use of chaff, instead 
of adding it randomly to the streams. In Section 5.1 we explain how to detect 
stepping stones using our algorithm when the attacker uses a limited amount of 
chaff (Theorem 7). In Section 5.2 we describe how an attacker could use chaff to 
make a pair of stepping-stone streams mimic two independent Poisson processes, 
and thus ensure that the pair of streams are not correlated. We then give upper 
bounds on the minimum chaff the attacker needs to do this (Theorems 8 and 9). 

5.1 Algorithm for Detection with Chaff 

Recall that our algorithm Detect- Attacks is based on the observation that, 
with high probability, two independent Poisson processes will differ by any fixed 




Detection of Interactive Stepping Stones: Algorithms and Confidence Bounds 271 



distance given sufficient time. An attacker can, therefore, evade detection with 
our algorithm by introducing a sufficient difference between the streams all the 
time. Specifically, our algorithm checks if the two streams have a difference that 

gp2 

is greater than packets every time there are packets in the union of the 
streams. To evade our algorithm as it stands (in Fig. 1), all that the attacker 
might need to do is to send one packet of chaff on the faster stream. 

Algorithm. We now modify Detect- Attacks slightly, to detect stepping- 
stone attacks under a limited amount of chaff. Instead of waiting for a difference 
of PA packets between the two streams, we could wait for a difference of 2pA 
packets. The independent Poisson processes would eventually get a difference 
of 2pA, but now, the attacker would need to send at least pA packets in chaff 
in order to evade detection. He could get away with exactly pA packets if he 
sends all of the chaff packets in the same time interval, on the same stream. 
However, as long as he sends less than pA packets of chaff in every time interval, 
the monitor will flag his streams as stepping stones^ . The complete algorithm is 
shown in Fig. 3. 



Detect- Attacks-Chaff {5, pa) 

Set m = log j , n. = — ^ • 

For m iterations 

Observe n packets on Si U S 2 . 

Compute d = Ni — A 2 . If d > 2pA return Normal. 
return Attack. 



Fig. 3. Algorithm for stepping-stone detection with with less than pA packets of chaff 
every — ^ packets 



Analysis. We now show that Detect- Attacks-Chaff will correctly identify 
stepping stones with chaff, as long as the attacker sends no more than pA packets 
of chaff for every — ^ packets. Further, any given non-attacking pair of streams 
will have no more than a 6 chance of being called a stepping stone. 

Theorem 7. Under the assumption that normal streams behave as Poisson pro- 
cesses, and the attacker sends less than pA packets of chaff every packets, 
the algorithm Detect- Attacks-Chaff will have a false positive rate of utmost 
S, if we observe log | intervals of packets each. 

Proof. Let 0 < Ni{w) — N 2 {w) < pA at total packet count w. Then, after n 
further packet arrivals, we want to bound the probability that the difference is 
still within [0,p/i], for a normal pair. Let Z = Ni{w n) — N 2 {w n). 

^ We choose to wait for a difference of 2pA packets here, because it is the integral 
multiple of pA that maximizes the rate at which the attacker may send chaff. We 
could replace it with the non-integral multiple of pa that maximizes the rate at 
which the attacker must send chaff, but we omit the details here. 
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As in the proof of Theorem 1 , for any x, we have: 



Pr[Z = x] < 




Therefore, over an interval of size 2p^, we have: 



Pr[0 < Z < 2p^] < 




Substituting n = we get Pr[0 < Z < 2p^] < i 

7T 

To ensure that this is bounded by the given confidence level, we take m such 
observations of n time steps, so that (i) < 6, which gives m > log 

Therefore, a normal pair will differ by at least 2p/\ with probability at least 
1 — (5, in log I intervals of n packets. 

On the other hand, for an attack pair with no chaff, we know that Ni{w) — 
N 2 {w) < PA- When the attacker can add less than pA packets of chaff in 
packets, Ni{w + n) — N 2 {w + n) < 2pA, and thus, difference in packet count an 
attack pair cannot exceed 2pA in n packets. □ 



Note that Theorem 7 is the analogue of Theorem 1 when the chaff rate is 
bounded as described above. The analogues to the other theorems in Section 4 
can be obtained in a similar manner. 

Obviously, the attacker can evade detection by sending more than pA packets 
of chaff for every — — packets. Further, if we count in pre-specified intervals, 
the attacker would only need to send pa packets of chaff in one of the intervals, 
since the algorithm only checks if the streams differ by the specified bound in 
any of the intervals. 

We could address the second problem by sampling random intervals, and 
checking if the difference Z in those intervals is at least 2pA- We could also 
modify our algorithm to check if the difference Z stays outside 2pA for at least 
a fourth of the intervals, and analyze the resulting probabilities with Chernoff 
bounds. To defeat this, the attacker would have to send at least fraction 

the total packets on the union {pA packets of chaff every packets) in an 

independent interval, so that every (sufficiently long) interval is unsuspicious. 

However, if the attacker just chooses to send a lot of chaff packets on his 
stepping-stone streams, then he will be able to evade the algorithm we proposed. 
This type of evasion is, to some extent, inherent in the problem, not just the 
detection strategy we propose. In the next section, we show how an attacker could 
successfully mimic two independent streams, so that no algorithm could detect 
the attacker. We also give upper bounds on the minimum chaff the attacker 
needs to add to his streams, so that his attack streams are completely masked 
as independent processes. 
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5.2 Hardness Result for Detection with Chaff 

If an attacker is able to send a lot of chaff, he can in effect ride his communication 
on the backs of two truly independent Poisson processes. In this section, we 
analyze how much chaff this would require. This gives limitations on what we 
could hope to detect if we do not make additional assumptions on the attacker. 

Specifically, in order to simulate two independent Poisson processes exactly, 
the attacker could first generate two independent Poisson processes, and then 
send packets on his streams to match them. He needs to send chaff packets 
on one of the streams, when the constraints on the other stream do not allow 
the non-chaff packet to be forwarded to/from it. In this way, he can mimic 
the processes exactly, and pair of streams will not appear to be a stepping- 
stone pair, to any monitor watching it. Note that even if the inter-packet delays 
were actively manipulated by the monitor, the attacker can still mimic two 
independent Poisson processes, and therefore, by our definition, will be able 
to evade detection. 

Let Ai be the rate of the first Poisson process, and A2 be the rate of the second 
Poisson process. In our analysis, we assume Ai = A2 = A If Ai A2, or 

Ai ^ A2 the attacker will need to send many more chaff packets on the faster 
stream, so Ai = A2 will be the best choice for the attacker. 

We model the Poisson processes as binomials. We choose to approximate 
the two independent Poisson processes of rate A as two independent binomial 
processes, for cleaner analysis. To generate these processes, we assume that the 
attacker flips two coins, each with A bias (of getting a head), at each time step^. 
He has to send a packet (either a real packet or chaff) on a stream when its 
corresponding coin turns up heads, and should send nothing when the coin turn 
up as tails. That way, he ensures that the two streams model two independent 
binomial processes exactly. Since the attacker generates the independent bino- 
mial processes, he could flip coins A or more time steps ahead, and then decide 
whether a non-chaff packet can be sent across for a particular coin flip that obeys 
all constraints, or if it has to be chaff. 

We now show how the attacker could simulate two independently-generated 
binomial processes with minimum chaff. First, the attacker generates two se- 
quences of independent coin flips. The following algorithm, Bounded-Greedy- 
Match, then produces a strategy that minimizes chaff for the attacker, for any 
pair of sequences of coin flips. Given two sequences of coin flips, the attacker 
matches a head in first stream at time t to the first unmatched head in the 
second stream in the time interval [t,t + A], All matched heads become real 
(stepping-stone) packets, and all the remaining heads become chaff. An example 
of the operation of the algorithm is shown in Fig. 5 . 2 . 

The following theorem shows that Bounded-Greedy-Match will allow the 
attacker to produce the minimum amount of chaff needed, when the attacker 
simulates two binomial processes that were generated independently. 



® We could, equivalently, assume that the attacker flips a coin with ^ bias k times in 
a time step. As k ^ 00 , the binomial approaches a Poisson process of rate A. 
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Normal Packets 
Chaff 



Fig. 4. An illustration of the matching produced by the algorithm Bounded-Greedy- 
Match on two given sequences, with A — 2 



Theorem 8. Given any pair of sequences of coin flips generated by two indepen- 
dent binomial processes, Bounded-Greedy-Match minimizes the chaff needed 
for a pair of stepping-stone streams to mimic the given pair of sequences. 

Proof. Suppose not, i.e., suppose there exists a sequence pair of coin flips a 
for which Bounded-Greedy-Match is not optimal. Let S be the strategy 
produced by Bounded-Greedy-Match for cr. Let S' be a better matching 
strategy, so that Chaff (S) > Chaff (S'). Then there exists a head in a such that 
h is matched with a head h' through S' , but not through S. 

Assume, wlog, that h is on the first stream at time t, and h' on the second 
stream. For S' to be a valid match, h' should be in [t,t + A], and h' must be 
unmatched under S' to any other head. Let us suppose that h' is matched to 
another (earlier than t) head on the first stream under S (otherwise Bounded- 
Greedy-Match would have generated a match between h and h' on S). 

We track chain of the matching heads in the sequence backwards (starting 
from h) in this way: we take the currently matched head in one strategy, and 
look for the head that matches it in the other strategy. When this chain of 
matchings stops, we must have an unmatched head, and one of following two 
cases (the manner in which we trace the chain of matching heads, along with 
the assumption that the unmatched head h is on the first stream, implies that 
we And only matched heads on the second stream of S, and the first stream of 



— Case 1 : The unmatched head is in stream 1 of S". In this case, an unmatched 
head in S correlates with an unmatched head in S' , and therefore, this 
particular case is not our counterexample, since each unmatched head under 
S will correspond to an unmatched head under S' . 

— Case 2\ The unmatched head is in stream 2 oi S. In this case, we have to 
have reached this head (call it gff) from its matching head g\ in S' ] we have 
to reach g\ from matched head g 2 in S. Since we are tracing backwards in 
time, time of g 2 is greater than the time of go. However, since go can be 
matched to gi, we have a contradiction, since we are not matching the head 
gi to the earliest available head go, as per Bounded-Greedy-Match. 



The analysis when h is on the second stream of S is similar. 
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Thus, with the algorithm Bounded-Greedy-Match, every unmatched head 
in S must have a corresponding unmatched head in S', therefore, Chaff (S) < 
Chaff {S'), creating a contradiction. □ 






Fig. 5. The proof of Theorem 8. All the figures give an illustration of how the heads 
are traced back, (a) and (b) show case 1 of the proof, and (c) and (d) show case 2 of 
the proof. By assumption, h is unmatched in S and matched in S', h is matched to h' 
in the strategy S'; in S, h' is matched to /i2; then, we look at h2’s match in S', call 
it /i3; use h3 to find /i4 in S, /i4 to find h5 in S', and so on. We continue tracing the 
matches of heads backwards in this manner until we stop, reaching either case 1 or 
case 2. In case 1, gl is unmatched in strategy S', and in S, gQ is unmatched in S, but 
gl is not matched greedily. 



Now we examine upper bounds on the chaff that will need to be sent by 
the attacker, in terms of the total packets sent. We give an upper bound on the 
amount of chaff that the attacker must send in Bounded-Greedy-Match. We 
note that our analysis shows how the attacker could do this if he mimics two 
independent Poisson processes, but it may not be necessary for him to do this 
in order to evade detection. 

Theorem 9. If the attacker ensures that his stepping-stone streams mimic two 
truly independent Poisson processes, then, under Bounded-Greedy-Match, 
the attacker will not need to send more than — ] = + 0.05 fraction 

of packets as chaff in expectation, when the Poisson rates of the streams are 
equal with rate X. 

Proof. We divide the total time (coin flips) into intervals that are A long, and 
examine the expected difference in one of these intervals. Notice that for the 
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packets that are within a specific A interval, matches are not dependent on the 
times when they were generated, (i.e., any pair of packets in this interval is no 
further than A apart in time, and therefore, could be made a valid match). Many 
more packets than this can be matched, across the interval boundaries, but this 
gives us an easy upper bound. 

Consider the packets in the union of the two streams in this interval. Each 
packet in this union can also be considered as though it were generated from 
a (different) unbiased coin, with heads as stream 1 and tails as stream 2; once 
again, we have a uniform random walk. Since every head can be matched to any 
available tail, the amount of chaff is the expected (absolute) difference in the 
number of heads and tails. Call this difference Z, and the packets on the union 
of the streams X. X is then a binomial with parameters 2A, and A. Therefore, 
E[X] = 2\A. The expectation of y is then the following: 



= E -E\Z\X = x\P{X = x) 



E = »■) 



< 0.05- 



1 



V2XA - 2a 



, where a = a/2A(1 — 2A)Z\. 



Since every interval of size A is identical, the attacker needs to send no more 
than ^ = + 0.05 fraction as chaff in expectation. □ 

^ 2XA-2y/2\(l-2\)A 



6 Conclusion 

In this paper, we have proposed and analyzed algorithms for stepping-stone de- 
tection using techniques from Computational Learning Theory and the analysis 
of random walks. Our results are the first to achieve provable (polynomial) up- 
per bounds on the number of packets needed to confidently detect and identify 
encrypted stepping-stone streams with proven guarantees on the probability of 
falsely accusing non-attacking pairs. Moreover, our methods and analysis rely 
on very mild assumptions, especially in comparison with previous work. We also 
examine the consequences when the attacker inserts chaff into the stepping-stone 
traffic, and give bounds on the amount of chaff that an attacker would have to 
send to evade detection. Our results are based on a new approach which can 
detect correlation of streams at a fine-grained level. Our approach may apply to 
more generalized traffic analysis domains, such as anonymous communication. 
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Abstract. We present a formal framework for the analysis of intrusion 
detection systems (IDS) that employ declarative rules for attack recog- 
nition, e.g. specification-based intrusion detection. Our approach allows 
reasoning about the effectiveness of an IDS. A formal framework is built 
with the theorem prover ACL2 to analyze and improve detection rules 
of IDSs. SHIM (System Health and Intrusion Monitoring) is used as an 
exemplary specification-based IDS to validate our approach. We have 
formalized all specifications of a host-based IDS in SHIM which together 
with a trusted file policy enabled us to reason about the soundness and 
completeness of the specifications by proving that the specifications sat- 
isfy the policy under various assumptions. These assumptions are prop- 
erties of the system that are not checked by the IDS. Analysis of these 
assumptions shows the beneficial role of SHIM in improving the security 
of the system. The formal framework and analysis methodology will pro- 
vide a scientific basis for one to argue that an IDS can detect known and 
unknown attacks by arguing that the IDS detects all attacks that would 
violate a policy. 

Keywords: Intrusion detection, verification, formal method, security 
policy 



1 Introduction 

Intrusion detection is an effective technology to supplement traditional security 
mechanisms, such as access control, to improve the security of computer systems. 
To date, over 100 commercial and research products have been developed and 
deployed on operational computer systems and networks. While IDS can improve 
the security of a system, it is difficult to evaluate and predict the effectiveness 
of an IDS with respect to the primary objective users have for the deployment 
of such a system: the ability to detect large classes of attacks (including variants 
of known attacks and unknown attacks) with a low false alarm rate. In addition. 



E. Jonsson et al. (Eds.): RAID 2004, LNCS 3224, pp. 278-295, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




Formal Reasoning About Intrusion Detection Systems 279 



it is difficult to assess, in a scientific manner, the security posture of a system 
with an IDS deployed. So far, experimental evaluation and testing have been the 
only approaches that have been attempted. There is a critical need to establish 
a scientific foundation for evaluating and analyzing the effectiveness of IDSs. 

This paper presents an approach to formal analysis of IDSs. Our approach 
is primarily applicable to IDSs that employ declarative rules for intrusion detec- 
tion, including signature-based detection and specification-based detection [16] 
[18] [3] [8] [4] [5] [22]. The former matches the current system or network activ- 
ities against a set of predefined attack signatures that represent known attacks 
and potential intrusive activities. The latter recognizes attacks as activities of 
critical objects that violate their specifications. Testing is currently being used to 
evaluate the effectiveness of the rules. Nevertheless, testing is usually performed 
according to the tester’s understanding of known attacks. It is difficult to verify 
the effectiveness of an IDS in detecting unknown attacks. 

Our approach is inspired by the significant body of formal methods research 
in designing and building trusted computer systems. Briefly, the process of de- 
signing and building a trusted system involves the development of a security 
model, which consists of a specification of a security policy (the security re- 
quirements or what is meant by security) and an abstract behavioral model of 
the system. Usually, the security policy can be stated as a mapping from sys- 
tem states to authorized (secure) and unauthorized (insecure) states [14]or as a 
property (often stated as an invariant) of the system (e.g., noninterference). The 
model is an abstraction of the actual system that provides a high level descrip- 
tion of the major entities of the system and operations on those entities. There 
may be layers of abstractions within the model, each a refinement of the higher 
level abstraction. Given the security policy and model, one should be able to 
prove that the model satisfies the security policy, assuming some restrictions on 
the state transition functions (e.g., the classical Bell and LaPadula model). 

Our framework consists of an abstract behavioral model, specifications of 
high-level security properties, and specifications of intrusion-detection rules. The 
abstract behavioral model captures the real behavior of the targeted system. In 
addition to common abstractions such as access control lists, processes, and 
files, the abstract behavioral model will capture the auditing capabilities of the 
targeted system(i.e., given an operation, it will be decided whether or not an 
audit event will be generated and what information about the operation will be 
visible). The specifications of intrusion-detection rules describe formally when 
the rules will produce IDS alarms given the sequence of audit events generated. 
Intrusion detection rules can be viewed as constraints on the audit trace of the 
system (e.g., the sequence of observable state changes). 

We employ the formal framework to analyze the properties of SHIM, a 
specification-based IDS that focuses on the behaviors of privileged programs. 
In SHIM, specifications are developed to constrain the behaviors of privileged 
programs to the least privilege that is necessary for the functionality of the 
program. 

ACL2 [15] is employed to describe an abstract system model that can be 
used as the basis for different IDSs. A hierarchical model is built to generalize 
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the verification of specifications. As an example, we formalize specifications of 
SHIM and a security policy (e.g, a trusted file access policy). And we prove that 
these specifications can satisfy the policy with various assumptions. Again, the 
assumptions represent activities that the SHIM IDS does not monitor, although 
it could if the IDS designer believes an attacker could cause the assumptions to 
be violated. 

The rest of the paper is structured as follows: Section 2 introduces and ana- 
lyzes intrusion detection rules, primarily used in a specification-based IDS such 
as SHIM. Section 3 describes a hierarchical framework of verification. Section 
4 shows an example of our verification approach. We formalize specifications 
of SHIM and prove that these specifications together with assumptions satisfy 
trusted file access policies. In Section 5 we discuss our results and the limitations 
of the verification method we developed. We conclude and provide recommen- 
dations for future work in Section 6. 



2 Analysis of Intrusion-Detection Rules 

Development of correct intrusion-detection rules is a very difficult and error- 
prone task: it involves extensive knowledge engineering on attacks and most 
components of the system; it requires a deep and correct understanding of most 
of the components in a system and how they work together; it requires the rule 
developers to be cautious and careful to avoid mistakes and gaps in coverage. 
Often, crafting of intrusion-detection rules is performed by human security ex- 
perts based on their knowledge and insights on attacks and security aspects of a 
system. Therefore, it is very difficult to assess whether a given set of intrusion- 
detection rules is correct (they detect the attacks). Furthermore, the complexity 
and subtlety of systems and attacks make it a challenging task to judge whether 
changes to the rules actually improve or degrade their efficacy with respect to 
their ability to detect new attacks. 

We discuss the subtleties involved in writing valid behavior specifications for a 
program. Traditionally, in specification-based IDSs, a valid behavior specification 
for a program declares what operations and system calls are allowed for the 
program. Whether an operation is allowed or not depends on the attributes of 
the process and the object reference, and attributes of the system calls. In SHIM, 
a specification for a program is a list of rules describing all the operations valid 
for the program. For example, the following rule in the line printer daemon (Ipd) 
specification allows the program to open any file in the /var/spool/hp directory 
to read and write. 

{open,$flag == 0-RDW RhhInDir{%F.path, '''' jvar j spool jhp"')') 

The expression formally describes a set of valid operations: any open() system 
call with the flag argument equal to 0-RD WR (open the file for read and write) 
with an absolute path name subordinate to the /var/spool/hp directory. 

One way to develop a specification for a program is to first identify what 
operations and accesses the program needs to support its functionality. Based on 
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an examination of the code or its behaviors, one writes rules in the specification 
to cover the valid operations of the program. The “draft” specification will be 
tested against the actual execution of the program. Often, the draft specification, 
when used to monitor the program execution, will produce false positives (i.e., 
valid operations performed by the program reported as erroneous because they 
are not included in the specification). Then, one augments the specification to 
include rules to cover these operations. In general, one needs to be very careful 
in writing the specification for a program to avoid errors. 

For example, given the above rule, if /var/spool/hp somehow is writable by 
attackers, they can create a link from /var/spool/hp/ file to the /etc/passwd file. 
A specification-based IDS with this rule in the specification of Ipd will permit 
this operation and the attack will go undetected. Therefore, we augment the rule 
to check the number of links to the file and to generate a warning if the number 
of links to the file is greater than one, thus preventing this attack from using hard 
links. This also works for soft links in our system because the audit record for an 
open() operation will provide the absolute pathname of the file being opened, if 
the path is a symbolic link. Based on our experience, writing specifications for 
a program is subtle and tricky, thus demanding an approach to rule validation. 

Little research has been done on analyzing intrusion-detection rules. Different 
approaches have been taken to specify and analyze the intrusion signatures and 
detection rules [12] [19] [17], primarily for signature-based IDSs. A declarative 
language, MuSigs, is proposed in [12] to describe the known attacks. Temporal 
logic formulas with variables are used to express specifications of attack scenarios 
[19]. Pouzol and Ducasse formally specified attack signatures and proved the 
soundness and completeness of their detection rules. In addition, data mining 
techniques and other AI techniques such as neutral network are used to refine 
and improve intrusion signatures [6] [13] [21]. 

Our approach is different from these approaches in various ways. First of 
all, we developed a framework to evaluate detection rules of different IDSs. We 
formalized security-relevant entities of an UNIX-like system as well as access 
logs. Detection rules including intrusion signatures and specifications can be 
formalized and reasoned about in the framework. 

Second, we proposed a method to verify security properties of IDSs together 
with assumptions, with respect to security policies. Security polices are always 
satisfied with sufficiently strong assumptions. So the key is to identify assump- 
tions that are strong enough but not too strong. An attack can violate a security 
policy by breaking its assumptions. So it is possible to verify the improvement 
of security by proving the weakening of assumptions. For example, assuming a 
policy P is satisfied with assumption A and with the deployment of the mecha- 
nism m , and P is satisfied with assumption B where A implies B, then we can 
say m improves the security because attacks violating assumption B will also 
violate A, but attacks that violate assumption A may not violate B . 

As our preliminary results, we have verified a significant property of specifi- 
cation-based IDSs: the capability to detect unknown attacks. In our verification, 
the specifications of SHIM satisfy a passwd file access policy with assumptions. 
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This means any attacks, including known attacks and unknown attacks, that 
violate the policy can be detected by SHIM. 

3 Framework 

We present a framework for analyzing detection rules in IDSs. Our goal is to 
answer the question of whether a given set of intrusion detection rules can satisfy 
the security requirements of the system. Security polices and properties of attacks 
are used to describe the security requirements of the system. The satisfaction of 
the security requirements determines whether violations of security policies or 
instances of attacks can be detected by the detection rules. 

3.1 Hierarchical Framework of Verification 

Figure 1 depicts the verification model, which consists of an abstract system 
model, an auditing model, detection rules, assumptions, and security require- 
ments. The basis of the model is the abstract system model (S) in which security- 
critical entities of the system are formalized. The auditing model (L) is necessary 
for the model because almost all IDSs are based on the analysis of the audit 
trails from operating systems, applications and network components. Detection 
rules (R) vary dependent on the IDS. In SHIM, detection rules are specifications 
of normal behaviors of privileged programs. Security Requirements (SR) define 
properties that should be satisfied to guarantee the security of the system. As- 
sumptions (H) are necessary for the verification. Security properties that we are 
not sure of and more important, properties that cannot be efficiently monitored 
will be declared as assumptions (e.g, kernel of the system is not subject to at- 
tack). Note that all of our assumptions could be checked by monitoring but at 
a substantial performance penalty to the IDS. 





Verification 
SULUHUR-^ SR 




Security requirements (SR) 


Detection rules (R) 


Assumptions (H) 




Auditing model (L) 


Abstract system model (S) 



Fig. 1. Verification Hierachy 



3.2 Formalization of the Model 

In this section, we describe how to construct the components of the framework. 
We start with the abstract system model. 
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The abstract system model plays an important role in the framework. It pro- 
vides a general basis to formalize security requirements, detection rules of IDSs 
and assumptions, and makes it possible to verify security properties of detection 
rules. To develop a simplified abstract model, we only formalize security-critical 
parts of UNIX-like systems. The model can be defined as a tuple {F, U, E, P, 
S) where F describes a file system, U shows user and group mechanisms, E 
corresponds to environment variables, P describes a list of processes, and S de- 
scribes system call interfaces of the kernel. Our preliminary experiment focuses 
on the access control mechanism, so we define access permissions of file objects 
and privileges of subjects. 

Because of the importance of the auditing component in IDSs, we formalize 
it separately from the abstract system model. We model the auditing component 
at the system call level. Assume A is a set of all system calls, let B C E* he all 
sequences of operations of a program A. A trace b G B presents a sequence of 
operations of A. For each operation of a program, we use a tuple (p, /, c, n) to 
indicate that a process p invokes a system call c on object / and assigns a new 
property n to f (e.g. a new owner for a file). 

Detection rules vary according to different IDSs. In a specification-based IDS, 
detection rules are specifications which are used to describe normal behaviors 
of systems. Conversely, in a signature-based IDS, detection rules are signatures 
that identify different attacks. In this paper, we focus on the specification-based 
approach. Suppose the set of all possible behaviors of a program is defined as B, 
a specification spec() can identify a set of valid behaviors VB where VB C B 
and for any trace b G VB, iff spec(b) = true. 

Security requirements are used to describe properties necessary to satisfy 
the security of the system. There are basically two ways to present the security 
requirements: one is to define security policies, the other is to describe attack 
scenarios. Security policies can map the behavior of a system into two states: 
authorized or unauthorized. In this way, a security policy security-policy () sep- 
arates the behavior of a system into an authorized behavior set AB and an 
unauthorized behavior set UB where AB C B, UB C B. For any trace b, b G 
AB iff security-policy (b) = “authorized” . Attacks are behaviors that violate the 
security policy. We can use functions to define characterizations of attacks. An 
attack function attack() can define a set of dangerous behaviors DB where DB 
C B and for any trace b, b G DB iff attack(b)=true. 

In the verification, we try to answer two questions: Can some security policies 
be satisfied by IDSs? And can some attacks remain undetected by IDSs? The first 
question can be formalized as follows. Given specification s and security policy 
p, is the valid behavior set VB that is defined by s a subset of the “authorized” 
behavior set AB defined by p. We describe this relation as VB C AB or for any 
trace b, spec( b) = true implies security-policy (b)= “authorized”. In some cases, 
assumptions are introduced in proving whether a security policy is satisfied by 
a specification. The verification can be described as: for any trace 

b G B, (assumption(b) = true) A (spec(b) = true) h (sr(b)= “authorized”). 
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The second question can be formalized as follows. Given an attack ab or a 
set of attacks DB, is ab a member of the valid behavior set VB or does DB share 
elements with VB. It can be described as ab ^ VB or DB n VB = (j>. 




(1) VB c AB 

security policy is satisfied 




A*B 

(2) ab ^ VB and DB n VB = 0. 
attacks are detected 



Fig. 2. Relationship among security policy, specifications and attacks 



3.3 Mechanization of the Model 

ACL2 is used in the mechanization of the framework. Structures and functions in 
ACL2 are used to formalize declarative components of the framework, including 
an abstract system model, audit data, detection rules of IDSs, assumptions, and 
security requirements. To perform the verifications we define the appropriate 
theorems in ACL2 and prove them using mathematical induction and the other 
proof mechanism of ACL2. 



Introduction of ACL2. ACL2 is a significant extension of Nqthm [1], intended 
for large scale verification projects. The ACL2 system consists of a programming 
language based on Common Lisp, a logic of total recursive functions, and a 
mechanical theorem prover [15]. The syntax of ACL2 is that of Common Lisp. In 
addition, ACL2 supports the macro facility of Common Lisp. The following data 
types are axiomatized: rational and complex numbers, character objects, strings, 
symbols and lists. Common Lisp functions on these data types are axiomatized or 
defined as functions or macros in ACL2 [9] . Several functions that are used in our 
verification are listed in table 1. The ACL2 logic is a first order logic of recursive 
functions providing mathematical induction on the ordinals up to eO and two 
extension principles: one for recursive definition and one for the constrained 
introduction of new function symbols. Each preserves the consistency of the 
extended logic. 

The ACL2 theorem prover has powerful and effective heuristics controlling 
the application of the following proof techniques: preprocessing including tautol- 
ogy checking using ordered binary decision diagrams(OBDDs) under user direc- 
tion; simplification by primitive type checking and rewriting; destructor elimina- 
tion; cross-fertilization using equivalence reasoning; generalization; elimination 
of irrelevant hypotheses; and mathematical induction. 
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Table 1. Important functions of ACL2 



Functions 


Descriptions 


Nil 


The empty list or False in Boolean contexts 


T 


True 


(if X y z) 


If x not equal to nil, return y, otherwise z 


(equal x y) 


If x and y have the same value, return t, otherwise nil 


(and X y) 


“And” operation in Boolean contexts 


(car 1) 


First element of list 1 


(cdr 1) 


All but the first element of list 1 


(consp X 1 ) 


Add X onto the front of list 1 


(implies x y) 


If X is nil or y is t, return t; otherwise return f 



Abstract System Model. The abstract system model is formalized as a struc- 
ture sys in ACL2. Security-critical components are defined as fields of the struc- 
ture. For each field, asserts are defined to check the integrity of values of the 
fields. A predicate sys-p is defined to recognize values that have the required 
structural form and whose fields satisfy the assertions. The predicate weak-sys-p 
is defined to recognize values that have the required structural form, but does 
not require the field assertions to be satisfied. Functions are defined to get, put 
or check values from specific fields. In our verification, we can use instances of 
the structure to tell whether a statement is true in a system with specific set- 
tings. On the other hand, the system model can appear in a theorem without 
specific values to indicate a general condition in which the statement is held. 

(def structure sys 

(proglist (: assert (and (not (endp proglist) ) (proglistp proglist)))) 

;list of programs , e.g. privileged programs 

(calllist (: assert (and (not (endp calllist) ) (calllistp calllist)))) 

;list of system calls, e.g. open, read, write etc. 

(filelist (: assert (and (not (endp f ilelist) ) (f ilelistp filelist)))) 

;list of files, e.g. /etc/passwd file 

(userlist (: assert (and (not (endp userlist) ) (userlistp userlist)))) 

;list of system users 

(envlist (: assert (and (not (endp envlist) ) (envlistp envlist))))) 

;list of environment variables, e.g. home directories in a UNIX system 



Audit Trail. The auditing capability of a system is formalized as a list of 
audit records and an audit record is formalized as a structure logrec in ACL2. 
We reference Sun Solaris BSM audit subsystem and simplify the audit record 
structure to four fields: process, file object, system call and new properties to 
the file object. 

(def structure logrec 

(pobj (: assert (and (consp pobj) (proc-obj-p pobj)))) 

; object of the process 

(fobj (:assert (and (consp fobj) (file-obj-p fobj)))) 

;object of the target file 
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(callobj (: assert (cOid (consp callobj) (syscall-obj-p callobj)))) 
;object of the system call 

(newattrobj (: assert (newattr-obj-p newattrob j ) ) ) ) 

; new properties of the target file 



Security Requirements. In our verification, different classes of attacks and 
security policies are formalized to analyze detection rules of IDSs. 

There are two ways to verify whether an attack can be detected by a specific 
IDS. The first method is to formalize possible audit trails, which include the 
attack scenarios, and then analyze the audit data according to the specification 
of the program for the violation. Such verification can be used to prove the ca- 
pabilities of the specifications to detect known attacks. A more general way is 
to describe the security property that will be violated by the attacks instead of 
particular audit trails. Then we develop a proof based on the property that the 
formalized specifications will always result in the system being monitored for 
that property. For example, in an ftp-write attack, an attacker takes advantage 
of a normal anonymous ftp misconfiguration. If the ftp directory and its subdi- 
rectories are owned by the ftp account or in the same group as the ftp account, 
the attacker will be able to add files (such as the .rhosts file) and eventually gain 
local access to the system. 

Security policies are also formalized to allow reasoning about the security 
properties of specifications. Trusted file access policies are security policies that 
we developed to keep trusted files from unauthorized access. In UNIX systems, 
a discretionary access control(DAC) mechanism defines whether a subject can 
access an object or not depending on the privilege of the subject and the access 
permission of the object. Some files are intended to be accessed by specific users 
or using specific programs. For example, the passwd file of a UNIX system should 
be editable by root using any program or by an ordinary user using the passwd 
program. Thus, file access policies are defined in our format as: (trusted file, 
authorized user, program, access) where trusted file is the file to be protected, 
authorized user defines the user that can access the file with any programs and 
program defines the program that can be used by other users to access the file. 

As an example, the passwd file access policy is defined as: (/etc/passwd, 
root, passwd, ( open- wr, create, chmod, chown, rename)). This policy is used in 
the verifications of the next section. The policy is formalized as a function in 
ACL2. 

Assumptions. Our verificatio methodology rests on assumptions. A system 
specification will have assumptions on how the system and programs behave. 
The specifications cannot be declared as complete before all assumptions of the 
specifications are identified. In some cases, once the assumptions are declared 
as required by the verification approach, an IDS does not have to monitor the 
properties asserted in these assumptions. There are two different kinds of as- 
sumptions in our verification: general assumptions of the system and specific 
assumptions of verifications. 
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System assumptions are very important although they are not formalized 
in our verification. Some general assumptions of the system model are listed as 
follows: 

- System kernel is not vulnerable to attack 

The security of system kernel is beyond the scope of this paper. We simply 
assume that system kernel is not vulnerable to attack. 

DAC mechanism of the system is correctly implemented 
Access control is a concern of our verification. So correct implementation of the 
DAC mechanism is an assumption for the security of the system. If the access 
control mechanism is not well implemented and a user can access objects for 
which he is not authorized, it is impossible to protect these objects by only 
constraining behaviors of privileged programs. 

Completeness on log data 

As a hypothesis of the IDS, audit logs should record the trace of attacks so that 
analysis of the audit logs may detect such attacks. Therefore, log data should 
include all important operations with their correct sequence. If an attacker can 
successfully eliminate his traces before an IDS analyzes them, it is impossible to 
detect this activity by an IDS. 

The specific assumption of verification will be discussed in section 4, in the 
context of the verifications of specific IDS and security policies. 

4 Specification and Verification of SHIM 

We formalized the specifications of a specification-based IDS, SHIM, and ana- 
lyzed them according to different security policies and attacks. 

4.1 Introduction of SHIM 

SHIM is a specification-based IDS. Specification-based IDSs are based on the 
creation of specifications that describe desired functionality for security-critical 
entities [7] [8] [20] [10] [23]. The security specifications in SHIM mainly focus 
on the valid operations of a UNIX privileged program. Privileged programs are 
analyzed because of their significant impact on system security. The effective 
user of a privileged program has root privileges, and attacks against a privileged 
program often exploit the privilege to access security-critical objects that are 
not intended to be accessed by the victim program. For example, in a ftp buffer 
overflow attack, an attack can invoke a shell with root privilege and use it to 
access any files of the system [2] ; that the attack is a buffer overflow is incidental 
to the specification. 

During program operation, the system accesses associated with the operation 
of a program are recorded in audit logs and matched against the specifications 
by SHIM. Mismatches are reported and almost always indicate an attack. The- 
oretically, SHIM is capable of detecting unknown attacks or variants of known 
attacks, and a report is issued as soon as a specification violation occurs. If the 
program was compromised by an attack (e.g. buffer overflow) and attempted to 
invoke any system calls that violated the specifications, an alert would be raised. 
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4.2 Formalization of Specifications 

In SHIM, a language, Parallel Environment Grammar (PE grammar), is devel- 
oped to define specifications that describe all valid operations of a program. The 
language permits the parameterization of the language syntax and environment 
variables that aid in parsing efficiency. PE grammar can be used to specify the 
valid execution traces of programs. The specification of ftp daemon is listed to 
show how the language works: 

SE: <prog> 

<prog> -> <validop> *; 

<validop> 

-> (DPEN_RD, WorldReadable($F.mode)) 

;the program can read a file that is world-readable 
I (0PEN_RD, CreatedByProc($P.pid, &$F)) 

;the process can read a file that is created by itself 
I (DPEN_RD, $F.ouid == $S.uid) 

;the process can read a file whose owner is the current user 
I (DPEN_WR, CreatedByProc($P.pid, &$F) ) 

I (0PEN_WR, $F.path == "/var/log/wtmp") 

;the process can write to a file a specific path 

I (0PEN_WR, $F.path == "/var/log/xferlog") 

I (DPEN_RW, $F.path == "/var/run/ftp.pids-all") 

I (open, $F.path == "/dev/null") 

I (unlink, CreatedByProc($P.pid, &$F)) 

I (CHMOD, CreatedByProc($P.pid, &$F) ) 

I (CHDWN, CreatedByProc($P.pid, &$F) ) 

I (f ork I I vf ork) 

I (QPEN_RD, InDir ($F .path, getHomeDir ($S .uid) ) ) 

; the process can read a file situated on a specific directory 
I (DPEN_WR, InDir ($F .path, getHomeDir ($S .uid) ) ) 

I (read, IsSocket($F.mode) && $K.lport == 21) 

;the process can get information from specific port 

I (write, IsSocket($F.mode) && $K.lport == 21) 

I (GREAT, InDir ($F .path, getHomeDir ($S .uid) ) ) 

I (EXEC, $path == "/bin/tar" I I $path == "/bin/compress" I I 

$path == "/bin/ls" I I $path == "/bin/gzip") ;END; 

In this specification, valid operations are defined with a term validop. Eighteen 
valid operations are included in this specification and each valid operation is a 
function of system calls and environment variables. For example the operation 
(OPEN_RD, WorldReadable($F.mode)) means this programs can open a file in 
read mode when the file is readable by all users. PE grammar is capable of 
defining a multi-state specification, but in this specification, only one state is 
used, namely the state associated with the invocation of the program. This 
specification shows a balance between expressiveness and detection efficiency. 

In our verification, a function is defined to check whether audit trails of a ftp 
daemon process violates the specification. The function accepts an audit trail 
as a parameter. If any operation of the audit trail violates the specification, the 
function will return false otherwise true. All valid operations are defined with 
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two functions: operation and property restriction. The operation function defines 
the operation on an object and the property restriction function defines the con- 
dition in which the operation will be performed. For example, the valid opera- 
tion (OPEN_RD, WorldReadable($F.mode)) can be formalized as (and (operate 
’openrd logrec) (WorldReadable (logrec-fobj logrec))). In the definition, the func- 
tion operate gets the correct operation, and function WorldReadable determines 
whether the permission of the file is world readable. 

4.3 Verifications 

Our verification focuses on the effectiveness of specifications of SHIM in satisfy- 
ing security requirements, including attacks and security policies. We attempt to 
address the issue whether, with specific detection rules, SHIM can detect specific 
attacks or detect attacks that cause specific security policies to be validated. 

Detection of Attacks. Attacks are modeled as sequences of operations. We 
use two ways to describe attack scenarios: an audit trail that contains an attack 
or a characterization of attacks. According to the characteristic of specification- 
based IDS, SHIM cannot detect any attacks that do not change the behavior of 
victim programs. So here we introduce an assumption about attack: 

Assumption: an attack cannot cause any damage without changing the behav- 
ior of a victim program. 

Then we can claim that SHIM is capable of detecting attacks before or at 
least “as” they cause damage to the system. 

For known attacks, we can always simulate their audit trails. Considering 
the buffer overflow attack against wuftpd 2.4.2-beta-18. The program can be 
compromised by overflowing a buffer in strcat(). We simulated an audit trail 
which invoked a shell after penetration of strcat(). Then we used the specification 
of ftp daemon to check this audit trail. A violation is reported and this indicates 
that the attack can be detected by SHIM. A further analysis shows that the 
violation is revealed by the audit record that indicates invocation of the shell. 
The call to the library function strcat() does not reveal aviolation. This result 
proves that the specification of SHIM can detect this buffer overflow attack if 
the attacker tries to invoke a shell after penetration. 

For unknown attacks, we can consider a group of similar attacks that invoke 
shells after they successfully compromise an ftp daemon program. We indicate a 
theorem which shows any audit trail with an operation invoking a shell will be 
detected by the specification of ftp daemon. The theorem is defined as : 

(defthm attack-ftp 
(implies 

(member ’exec "/bin/bash" log sys) 

; any operation invoking a shell 
(not (spec Jtpd sys log nil)) 

;violate the specification of ftp program 

)) 
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This theorem demonstrates an important feature of SHIM: detection of unknown 
attacks. Any unknown attacks against the ftp daemon will be detected if an op- 
eration of invoking shells is observed. The proof of the theorem is straightforward 
because /bin/bash is not a valid path for the exec system call in the specification. 



Proving a Specification Satisfies a Security Policy. In this section, we 
carry out a verification that indicates the trusted file access policy is satisfied 
by specifications of SHIM with some assumptions. 

We use the passwd file access policy as an example in this verification. As 
we defined in section 3, the passwd file access policy defines how the passwd 
file should be accessed. According to the DAC mechanism of UNIX systems, 
any user without root privilege cannot modify the passwd file except through a 
privileged program. If the DAC mechanism is well implemented, no user except 
root can use unprivileged programs, like vi, to change the passwd file. So, we 
only focus on the behavior of privileged programs in verifying that the system 
satisfies the policy. 

Given an audit trail of a specific privileged program, we try to prove that 
any audit trail that passes the check of the specification will satisfy the passwd 
file access policy. We use ftp deamon as an example to show how it works. 

The proof is defined as a theorem which is indicated below. The formalization 
of the abstract system model sys and audit data log are used in this theorem. 
We may notice that some assumptions are added to complete the proof. Two 
important verification assumptions are made in this proof. 

The first assumption is about the access permissions of the passwd file. The 
passwd file can only be protected when it has proper access permission. If the 
passwd file is set world writeable, the integrity of the file cannot be protected 
because any user has the privilege to change the passwd file. 

The other assumption is concerned with the setting of the home directory of 
the user who attempts to access the passwd file. A user can access the passwd 
file if his home directory is set as / etc. The reason is that the specification of the 
ftp deamon allows the user to access the files under his home directory. In fact, 
this assumption can be guaranteed by deploying some configuration checking 
tools such as KUANG [24]. But in SHIM, such a property of the system is not 
monitored. With these assumptions, any audit data that passes the specification 
check of the ftp deamon will satisfy the passwd file access policy. 

(defthm passwd-ftp 
(implies 

(and (not (member ’ (/ etc passwd) created)) 

;passwd file was not created by the process 

(consp log) (consp sys) (logp log) (consp created) (sys-p sys) 

; format checking 

(validuser sys log) 

; assumption: no invalid user as determined by the audit data 
(passwdsafe log) 
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; assumption: passwd file has proper permissions 
(homedirsafe sys) 

; assumption: home diretory settings are correct 
(spec_ftpd sys log created)) 

;the specification is not violated by any operations 
(not-access-passwd log) 

;then, thepasswd access policy is satisfied 

)) 

Using a similar method, we have proved that the specification of the Ipd 
program satisfies the passwd file access policy with the assumption that the en- 
vironment variable printerspool is not misconfigured. Changes to environment 
variables are not monitored by SHIM, so this assumption clearly covers a prop- 
erty that SHIM cannot check. 

Composition of Specifications Satisfies the Policy. A further question is 
whether the composition of different specifications will satisfy the passwd file 
access policy. In this section, we consider concurrent execution of different priv- 
ileged programs. We use ftp daemon and Ipd as examples to show that the 
composition of specifications of these two programs satisfying the policy. 




Fig. 3. Mechanism of SHIM to filter concurrent execution audit log 



As shown in figure 3, in SHIM, the audit filter is used to separate the audit 
trail of individual programs from the audit data of the system. We simulate the 
filter using a function filter(prog, log) in ACL2, where prog is the name of the 
program and log is the audit trail of the system. A question is whether the filter 
will change the security property of the audit trail. If the filter maps the data 
trail of a few privileged programs to the audit trail of each program and all 
subsets of the data trail satisfy the passwd file access policy, does this means the 
audit trail satisfies the policy? 
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Suppose log is the audit trail of ftpd and Ipd. We have proved that if the 
audit trail of ftpd, filter (’ftpd, log), can pass the specification check of ftpd and 
if the audit trail of Ipd, filter (’Ipd, log), can pass the specification check of Ipd, 
the audit trails of ftpd and Ipd satisfy the passwd file access policy. 

(defthm passwd-specs 
(implies 

(not (member ’ (/ etc passwd) created)) 

;passwd file was not created by the process 
(implies 

(and (logp log) (consp log) (consp sys) (sys-p sys) (procsafe log) 
; format checking 

(passwdsafe log) (homedirsafe sys) (validuser sys log) 

; assumptions for ftpd program 
(validenv sys 'printerspool) 

; assumptions for Ipd program 

(spec_ftpd sys (filter ’ftpd log) created) 

;the specification of ftpd is not violated by any operations 
(spec_lpd sys (filter ’Ipd log) created)) 

;the specification of Ipd is not violated by any operations 
(not-access-passwd log))) ) 

;then, the passwd access policy is satisfied 

We notice that the assumptions in this verification are the union of assump- 
tions in these two verifications that have been proved priviously. All theorems 
appearing in this section have been proved automatically by the ACL2 theorem 
prover using rewriting and mathematical induction. 

4.4 Performance 

We measure the performance of ACL2 in carrying out the proofs described above. 
We formalize the abstract system model, detection rules and security policies 
with 174 functions and 13 data structures. We defined and proved 56 lemmas 
and theorems to complete the verification. It took three weeks to develop all the 
functions and complete the verification. On a 450MHZ Pentium machine with 
384 MB memory, ACL2 spent 15.21 minutes to complete the verification. This 
suggests that using ACL2 to formalize and verify security properties of IDSs is 
a feasible approach. 

5 Discussion 

The assumptions of the verification process may, in some cases, be guaranteed 
through other tools. In the exemplary verification, we introduced assumptions 
needed to satisfy the passwd file access policy. These assumptions relate to access 
permissions of target objects (e.g., the passwd file cannot be world- writable), 
proper configurations (e.g., home directories of users cannot be /etc/), etc. SHIM 
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is not capable of monitoring these static properties of the system. But these 
assumptions can be checked by deploying other security tools such as Tripwire 
[11] [25] . 

In our verification, the soundness and completeness of detection rules of IDSs 
are not yet completely proved. If the soundness of the detection rules could be 
verified, the false positive rate of IDSs would theoretically be proved to be zero. 
In SHIM, the detection rules are specifications of the system. It is feasible, in 
principle, to prove the soundness of specifications by comparing the specifications 
with the implementation of the system. Automatic generation and verification of 
specifications can be achieved by associating formal methods with code analysis. 
As an extreme but practically useless example, it is easy to prove a specification 
rejecting all possible behaviors is sound. Considering the huge false negative 
rate, this specification is clearly not an acceptable solution even with a zero 
false positive rate. If the completeness of detection rules can be verified, the 
false negative rate of IDSs will be zero. Similarly, a specification accepting all 
behaviors can be proved complete. 

The ACL2 theorem prover is used in our verification. It provides reliable ver- 
ification by using well-accepted deduction rules, e.g., mathematical induction. 
By describing properties of attacks, we can prove that all the attacks (includ- 
ing known attacks and unknown attacks) with specific operations (e.g. invoking 
shell) can be detected by SHIM. This verifies an important and often-cited claim 
of specification-based intrusion detection: detection of unknown attacks. There 
are a few limitations about mechanical theorem provers. First, proof creation 
of almost any practical properties correct in theorem provers is not totally au- 
tomatic. Although theorem provers help find missing steps in proofs, it is still 
impossible for a theorem prover to create proofs without human interaction. 
Second, even if a proposition cannot be proved by a theorem prover, it doesn’t 
indicate the proposition is wrong. Also it is difficult to find a counter-example 
to show conditions under which a property is incorrect. 

6 Conclusions and Future Work 

In this paper, we present a formal framework that can be used to evaluate 
detection rules of IDSs. ACL2 is used to formalize declarative components of the 
framework and to carry out the verifications. An abstract system model is built 
as the basis for verifications. Trusted file access policies are developed to define 
authorized access on security-critical objects of a system. We also report on our 
experience with a preliminary implementation of this framework in reasoning 
about security properties of SHIM, a specification-based IDS. We have formalized 
all detection rules of SHIM, specifications for privileged programs, and addressed 
two important issues about SHIM (and specification-based IDS, in general): what 
attacks can be detected by SHIM and whether abstract security policies can be 
satisfied by SHIM. An important feature of SHIM, its ability to detect unknown 
attacks, is actually verified by specifying properties of attacks. 

Potential future work includes analyzing misuse detection systems (i.e. signa- 
ture-based IDSs) and network IDSs; generating specification using code analysis; 
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verifying soundness of specifications; and developing realistic security policies for 
network protocols. 
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Abstract. As the frequency of attacks faced by the average host connected to the 
Internet increases, reliance on manual intervention for response is decreasingly 
tenable. Operating system and application based mechanisms for automated re- 
sponse are increasingly needed. Existing solutions have either been customized 
to specific attacks, such as disabling an account after a number of authentication 
failures, or utilize harsh measures, such as shutting the system down. In contrast, 
we present a framework for systematic fine grained response that is achieved by 
dynamically controlling the host’s exposure to perceived threats. 

This paper introduces a formal model to characterize the risk faced by a host. It 
also describes how the risk can be managed in real-time by adapting the exposure. 

This is achieved by modifying the access control subsystem to let the choice 
of whether to grant a permission be delegated to code that is customized to the 
specific right. The code can then use the runtime context to make a more informed 
choice, thereby tightening access to a resource when a threat is detected. The 
running time can be constrained to provide performance guarantees. 

The framework was implemented by modifying the Java Runtime. A suite of 
vulnerable Jigsaw servlets and corresponding attacks was created. The follow- 
ing were manually added: code for dynamic permission checks; estimates of the 
reduction in exposure associated with each check; the frequencies with which in- 
dividual permissions occurred in a typical workload; a global risk tolerance. The 
resulting platform disrupted the attacks by denying the permissions needed for 
their completion. 

1 Introduction 

This paper presents a new method of intrusion prevention. We introduce a mechanism 
to dynamically alter the exposure of a host to contain an intrusion when it occurs. A 
host’s exposure comprises the set exposures of all its resources. If access to a resource 
is to be controlled, then a permission check will be present to safeguard it. The set of 
permissions that are utilized in the process of an intrusion occurring can thus be viewed 
as the system’s exposure to that particular threat. 

By performing auxiliary checks prior to granting a permission, the chance of it being 
granted in the presence of a threat can be reduced. By tightening the access control 
configuration, the system’s exposure can be reduced. By relaxing the configuration, the 
exposure can be allowed to increase. The use of auxiliary checks will introduce runtime 
overhead. In addition, when permissions are denied, applications may be prevented 
from functioning correctly. These two factors require that the use of the auxiliary checks 
must be minimized. 
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tiative Graduate Fellowship. 
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We first investigate how the auxiliary checks can be performed by modifying the 
access control subsystem. After that we introduce a model for measuring and managing 
the risk by dynamically altering the host’s exposure. Finally, we demonstrate how the 
approach can be used to contain attacks in real-time. 

2 Predicated Permissions 

One approach is to use a subset of the security policy that can be framed intuitively. 
While this method suffers from the fact that the resulting specihcation will not be com- 
plete, it has the beneht that it is likely to be deployed. The specific subset we consider is 
that which constitutes the authorization policy. These consist of statements of the form 
{a ^ p). Here p is a permission and cr can be any legal statement in the policy, L. If 
cr holds true, then the permission p can be granted. The reference monitor maintains an 
access control matrix, M, which represents the space of all combinations of the set of 
subjects, S, set of objects, O, and the set of authorization types, A. 

M = S X O X A, where p{i,j,k) £ M (1) 

Traditionally, the space M is populated with elements of the form: 

= l (2) 

if the subject S'[i] should be granted permission A[k] to access object 0[j], and other- 
wise with: 

p{i,j,k) = 0 (3) 

In our new paradigm, we can replace the elements of M with ones of the form: 

p{i,j,k) = a, where a £ L (4) 

Thus, a permission check can be the evaluation of a predicate framed in a suitable 
language, L, which will be required to evaluate to either true or false, corresponding to 
1 or 0, instead of being a lookup of a binary value in a static conhguration. 

3 Active Monitoring 

To realize our model of evaluating predicates prior to granting permissions, we augment 
a conventional access control subsystem by interceding on all permission checks and 
transferring control to our ActiveMonitor as shown in Figure 1^ If an appropriate 
binding exists, it delegates the decision to code customized to the specihc right. Such 
bindings can be dynamically added and removed to the running ActiveMonitor through 
a programming interface. This allows the restrictiveness of the system’s access control 
configuration to be continuously varied in response to changes in the threat level. 

Our prototype was created by modifying the runtime environment of Sun’s Java De- 
velopment Kit (JDK 1.4), which runs on the included stack-based virtual machine. The 
runtime includes a reference monitor, called the AccessController, which we altered 
as described below. 

* The impact of interceding alone (without counting the effect of evaluating predicates) does not 
impact the running time of SPECjvm98 [SPECjvm98] with any statistical significance. 
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Fig. 1. Static permission lookups are augmented using an ActiveMonitor which facilitates the 
use of mntime context in deciding whether to grant a permission. ActiveMonitor predicated 
permissions have 3 distinguishing features: (i) Constant running time, (ii) Dynamic activation if 
expected benefit exceeds cost, (iii) Interrogatable for cause of denial. 



3.1 Interposition 

When an application is executed, each method that is invoked causes a new frame to he 
pushed onto the stack. Each frame has its own access control context that encapsulates 
the permissions granted to it. When access to a controlled resource is made, the call 
through which it is made invokes the AccessController’s checkPennission() method. 
This inspects the stack and checks if any of the frames’ access control contexts contain 
permissions that would allow the access to be made. If it finds an appropriate permission 
it returns silently. Otherwise it throws an exception of type AccessControlException. 
See [Koved98] for details. 
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We altered the checkPermission( ) method so it first calls the active ActiveMonitor’s 
checkPermissionO method. If it returns with a null value, the AccessController’s 
checkPermissionO logic executes and completes as it would have without modifica- 
tion. Otherwise, the return value is used to throw a customized subclass of AccessCon- 
trolException which includes information about the reason why the permission was 
denied. Thus, the addition of the ActiveMonitor functionality can restrict the permis- 
sions, but it can not cause new permissions to be granted. Note that it is necessary to in- 
voke the ActiveMonitor’s checkPermissionO first since the side-effect of invoking this 
method may be the initiation of an exposure-reducing response. If it was invoked after 
the AccessController’s checkPermissionO, then in the cases that an AccessControlEx- 
ception was thrown, control would not flow to the Active Monitor’s checkPermission( ) 
leaving any side-effect responses uninitiated. 

Code that is invoked by the ActiveMonitor should not itself cause new ActiveMon- 
itor calls, since this could result in a recursive loop. To avoid this, before the Active- 
Monitor’s checkPermission{ ) method is invoked, the stack is traversed to ensure that 
none of the frames is an ActiveMonitor frame, since that would imply that the current 
thread belonged to code invoked by the ActiveMonitor. If an ActiveMonitor frame is 
found, the AccessController’s checkPermission( ) returns silently, that is it grants the 
permission with no further checks. 

3.2 Invocation 

When the system initializes, the ActiveMonitor first creates a hash table which maps 
permissions to predicates. It populates this by loading the relevant classes, using Java 
Reflection to obtain appropriate constructors and storing them for subsequent invoca- 
tion. At this point it is ready to accept delegations from the AccessController. 

When the ActiveMonitor’s 
checkPermsissionO method is 
invoked, it uses the permission 
passed as a parameter to per- 
form a lookup and extract any 
code associated with the per- 
mission. If code is found, it is 
invoked in a new thread and 
a timer is started. Otherwise, 
the method returns null, in- 
dicating the AccessController 
use the static configuration de- 
cide if the permission should 
be granted. The code must be a subclass of the abstract class PredicateThread. A 
skeletal version is presented in Figure 2. This ensures that it will store the result in a 
shared location when the thread completes and notify the ActiveMonitor of its com- 
pletion via a shared synchronization lock. 

The shared location is inspected when the timer expires. If the code that was run 
evaluated to true, then a null is returned by the ActiveMonitor’s checkPermissionO 
method. Otherwise a string describing the cause of the permission denial is returned. 



public abstract class PredicateThread extends Thread{ 

protected PredicateThread (Permission permission, 

Obj ect lock) ; 

public void run ( ) { 

if (condition) result=true; 

synchronized (lock) { 
lock .notify ( ) ; 



public boolean getResultO ; 



Fig. 2. Skeletal version of PredicateThread 
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If the code had not finished executing when the timer expired, a string denoting this 
is returned. As described above, when a string is returned, it is used by the modihed 
AccessController to throw an ActiveMonitorException, our customized subclass of 
AccessControlException, which includes information about the predicate that failed. 
The thread forked to evaluate code can be destroyed once its timer expires. Care must be 
taken when designing predicates so that their destruction midway through an evaluation 
does not affect subsequent evaluations. 

Finally, the ActiveMonitor’s own configuration can be dynamically altered. It ex- 
poses enableSafeguard( ) and disableSafeguard( ) methods for this. These can be used 
to activate and deactivate the utilization of the auxiliary checks for a specific permis- 
sion. If a piece of code is being evaluated prior to granting a particular permission 
and there is no longer any need for this to occur, it can be deactivated with the dis- 
ableSafeguard( ) method. Subsequently that permission will be granted using only the 
AccessController’s static configuration using a lookup of a binary value. Similarly, if 
it is deemed necessary to perform extra checks prior to granting a permission, this may 
be enabled by invoking the enableSafeguard( ) method. 

4 Risk 

Given the ability to predicate permissions the successful verihcation of auxiliary con- 
ditions, we now consider the problem of how to choose when to use such safeguards. 

The primary goal of an intru- 
sion response system is to guard 
against attacks. However, invok- 
ing responses arbitrarily may safe- 
guard part of the system but leave 
other weaker areas exposed. Thus, 
to effect a rational response, it is 
necessary to weigh all the possi- 
ble alternatives. A course of ac- 
tion must then be chosen which 
will result in the least damage, 
while simultaneously assuring that 
cost constraints are respected. Risk 
management addresses this prob- 
lem. 

4.1 Risk Factors 

Analyzing the risk of a system re- 
quires knowledge of a number of 
factors. Below we describe each of these factors along with its associated semantics. 
We define these in the context of the operating system paradigm since our goal is host- 
based response. 

The paradigm assumes the existence of an operating system augmented with an 
access control subsystem that mediates access by subjects to objects in the system using 




Fig. 3. Risk can be analyzed as a function of the threats, 
their likelihood, the vulnerabilities, the safeguards, the 
assets and the consequences. Risk can be managed by 
using safeguards to control the exposure of vulnerable 
resources. 
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predicated permissions. In addition, a host-based intrusion detection system is assumed 
to be present and operational. 

Threats. A threat is an agent that can cause harm to an asset in the system. We define a 
threat to be a specific atfack againsf any of fhe application or system software that is 
running on the host. It is characterized by an intrusion detection signature. The set 
of threats is denoted by T = {fi, t2, ■ ■ .}, where ta G T is an intrusion detection 
signature. Since ta is a host-based signature, it is comprised of an ordered set of 
events S{ta) = {si, S2, ■ • ■}• If this set occurs in the order recognized by the rules 
of the intrusion detector, it signifies the presence of an attack. 

Likelihood. The likelihood of a threat is the hypothetical probability of it occurring. If 
a signature has been partially matched, the extent of the match serves as a predictor 
of the chance that it will subsequently be completely matched. A function /i is used 
to compute the likelihood of threat ta- fJ, can be threat specific and will depend on 
fhe history of system events that are relevant to the intrusion signature. Thus, if 
i? = {ei,e2,...} denotes the ordered set of all events that have occurred, then: 

T{ta) = n S{ta)) ( 5 ) 



where n yields the set of all events that occur in the same order in each input set. 
Our implementation of /i is described in Section 7 . 1 . 

Assets. An asset is an item that has value. We define the assets to be the data stored in 
the system. In particular, each file is considered a separate object 0/3 G O, where 
O = {oi, 02, . . .} is the set of assets. A set of objects A{ta) C O is associated 
with each threat ta- Only objects 0/3 G A{ta) can be harmed if the attack that is 
characterized by ta succeeds. 

Consequences. A consequence is a type of harm that an asset may suffer. Three types 
of consequences can impact the data. These are the loss of confidentiality, integrity 
and availability. If an object op G A{ta) is affected by the threat ta, then the re- 
sulting costs due to the loss of confidentiality, integrity and availability are denoted 
by c{op), i{op), and a{op) respectively. Any of these values may be 0 if the attack 
can not effect the relevant consequence. However, all three values associated with 
a single object can not be 0 since in that case op G A{ta) would not hold. Thus, 
the consequence of a threat ta is: 

C{ta)= ^ c{op) + i{op) + a{op) (6) 

Vulnerabilities. A vulnerability is a weakness in the system. It results from an error 
in the design, implementation or configuration of either the operating system or 
application software. The set of vulnerabilities present in the system is denoted by 
W = {w\,W2, ■ ■ •}. W{ta) C IV is the set of weaknesses exploited by the threat 
ta to subvert the security policy. 

Safeguards. A safeguard is a mechanism that controls the exposure of the system’s 
assets. The reference monitor’s set of permission checks P = {pi,P2, ■ ■ ■} serve as 
safeguards in an operating system. Since the reference monitor mediates access to 
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all objects, a vulnerability’s exposure can be limited by denying the relevant per- 
missions. The set P{wj) C P contains all the permissions that are requested in the 
process of exploiting vulnerability w-y. The static configuration of a conventional 
reference monitor either grants or denies access to a permission p\. This exposure 
is denoted by v{p\), with the value being either 0 or 1. The active reference moni- 
tor can reduce the exposure of a statically granted permission to v'{p\), a value in 
the range [0,1]. This reflects the nuance that results from evaluating predicates as 
auxiliary safeguards.) 

Thus, if all auxiliary safeguards are utilized, the total exposure to a threat ta is: 



where: 



V(f„) 



E 

PS&Pita) 



v{p\) X v'{p\) 

\P{tc)\ 



P{tc) = U 

WyGW{ta) 



(7) 

( 8 ) 



5 Runtime Risk Management 

The risk to the host is the sum of the risks that result from each of the threats that it 
faces. The risk from a single threat is the product of the chance that the attack will 
occur, the exposure of the system to the attack, and the cost of the consequences of the 
attack succeeding [NIST800-12]. Thus, the cumulative risk faced by the system is: 

If the risk posed to the system is to be managed, the current level must be contin- 
uously monitored. When the risk rises past the threshold that the host can tolerate, the 
system’s security must be tightened. Similarly, when the risk decreases, the restrictions 
can be relaxed to improve performance and usability. This process is elucidated below. 

The system’s risk can be reduced by reducing the exposure of vulnerabilities. This 
is is effected through the use of auxiliary safeguards prior granting a permission. Simi- 
larly, if the threat recedes, the restrictive permission checks can be relaxed. 

5.1 Managed Risk 

The set of permissions P is kept partitioned into two disjoint sets, 'f'(P) and f?(P), 
that is 'f'(P) n f7(P) = 4> and P(P) U f?(P) = P. The set P(P) C P contains the 
permissions for which auxiliary safeguards are currently active. The remaining permis- 
sions f2{P) C P are handled conventionally by the reference monitor, using only static 
lookups rather than evaluating associated predicates prior to granting these permissions. 

At any given point, when the set of safeguards 'P{P) is in use, the current risk TZ' is 
calculated with: 






( 10 ) 
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where: 






E 

p>e-P(ic)nr2(P) 



v{p\) 



+ E 



v{p\) X v'{p\) 
\P{to.)\ 



( 11 ) 



5.2 Risk Tolerance 

While the risk must be monitored continuously, there is a computational cost incurred 
each time it is recalculated. Therefore, the frequency with which the risk is estimated 
must be minimized to the extent possible. Instead of calculating the risk synchronously 
at fixed intervals in time, we exploit the fact that the risk level only changes when the 
threat to the system is altered. 

An intrusion detector is assumed to be monitoring the system’s activity. Each time 
it detects an event that changes the extent to which a signature has been matched, it 
passes the event e to the intrusion response subsystem. The level of risk TZb before e 
occurred is noted, and then the level of risk TZa after e occurred is calculated. Thus, 
TZa = TZb + e. where e denotes the change in the risk. Since the risk is recalculated only 
when it actually changes, the computational cost of monitoring it is minimized. 

Each time an event e occurs, either the risk decreases, stays the same or increases. 
Each host is configured to tolerate risk upto a threshold, denoted by TZq. After each 
event e, the system’s response guarantees that the risk will return to a level below this 
threshold. As a result, TZb < TZq always holds. If e = 0, then no further risk manage- 
ment steps are required. 

If e < 0, then TZa < TZq since TZa = TZb + e <TZb < TZq. At this point, the sys- 
tem’s security configuration is more restrictive than it needs to be. To improve system 
usability and performance, the response system must deactivate appropriate safeguards, 
while ensuring that the risk level does not rise past the threshold TZq. 

If e > 0 and TZa < TZo, then no action needs to be taken. Even though the risk has 
increased, it is below the threshold that the system can tolerate, so no further safeguards 
need to be introduced. In addition, the system will not be able to find any set of unused 
safeguards whose removal will increase the risk by less than TZq — TZb — e, since the 
presence of such a combination would also mean that the set existed before e occurred. 
It is not possible that such a combination of safeguards existed before e occurred since 
they would also have satisfied the condition of being less than TZq — TZb and would have 
been utilized before e occurred in the process of minimizing the impact on performance 
in the previous step. 

If e > 0 and TZa > TZq, then action is required to reduce the risk to a level below 
the threshold of tolerance. The response system must search for and implement a set 
of safeguards to this end. Since the severity of the response is dependent on the current 
risk level, the risk recalculation can not be delayed despite the additional overhead it 
imposes at a point when the system is already stressed. 
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5.3 Recalculating Risk 

When the risk is calculated the first time, Equation 9 is used. Therefore, the cost is 
0(|T|x|P|x|0|). Since the change in the risk must be repeatedly evaluated during real- 
time reconfiguration of the runtime environment, it is imperative the cost is minimized. 
This is achieved by caching all the values V'{ta) xC(ta) associated with threats ta G T 
during the evaluation of Equation 9. Subsequently, when an event e occurs, the change 
in the risk e = S(7Z', e) can be calculated with cost 0(|T|) as described below. 

The ordered set E refers to all the events that have occurred in the system prior to 
the event e. The change in the likelihood of a threat ta due to e is: 

S{T{ta),e) = n{ta, (P U e) n S{ta)) - n S(ta)) (12) 

The set of threats affected by e is denoted by A{T, e). A threat ta G A(T, e) is con- 
sidered to be affected by e if i5(T (ta), e) ^ 0, that is its likelihood changed due to the 
event e. The resultant change in the risk level is: 

6(n',e)= Y. 5(T(ta),e)xV(ta)xC(ta) (13) 

taeA(T,e) 



6 Cost / Benefit Analysis 



After an event e occurs, if the risk level TZa increases past the threshold of risk tolerance 
TZq, the goal of the response engine is to reduce the risk by Sg > TZa —TZq to a. level 
below the threshold. To do this, it must select a subset of permissions p(Q(P)) C 
17(P), such that adding the safeguards will reduce the risk to the desired level. By 
ensuring that the permissions in p(fI(P)) are granted only after relevant predicates are 
verified, the resulting risk level is reduced to: 



TZ" = Y X X C(ta) 






where the new vulnerability measure, based on Equation 7, is: 

v(p\) 



V"(ta) = Y 

v(p\) X v'ipx) 



\p(tc)\ 



E 



(14) 



(15) 



pxG(P{t^)n^{P)Up(Q(P))) 



\P(to,)\ 



Instead, after an event e occurs, if the risk level TZa decreases, the goal of the re- 
sponse engine is to allow the risk to rise by 6g < TZq — TZa to a level below the threshold 
of risk tolerance TZq. To do this, it must select a subset of permissions p('E(P)) C 'I'(P), 
such that removing the safeguards currently in use for the set will yield the maximum 
improvement to runtime performance. After the safeguards are relaxed, the risk level 
will rise to: 

TZ" = Y X V"(ta) X C(ta) 



( 16 ) 
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where the new vulnerability measure, based on Equation 7, is: 






E 

PxeP(t^)nn(P)up(<i'{P)) 



v{p\) 



E 

P>6-P(t,.)nf(p)-p(f(p)) 



v{p\ X v'{p\)) 
\P{tc)\ 



(17) 



There are ways of choosing subsets p{fl{P)) C fi{P) for risk reduction 

or subsets p{'P{P)) C 'P{P) for risk relaxation. When selecting from the possibilities, 
the primary objective is the maintenance of the bound TZ" < TZq, where TZ" = TZa — Sg 
in the case of risk reduction, and TZ" = TZa + 5g in the case of risk relaxation. 

The choice of safeguards also impacts the performance of the system. Evaluating 
predicates prior to granting permissions introduces latency in system calls. A single 
interrogation of the runtime, such as checking how much swap space is free, takes 
about 1ms. When file permission checks were protected with safeguard code that ran 
for 150ms, 5 of 7 applications in [SPECjvm98] took less than 2% longer to run on 
average, while the other 2 applications took 37% longer. Hence, the choice of subsets 
p{Q{P)) or p{\P{P)) is subject to the secondary goal of minimizing the overhead intro- 
duced. (In practice, the cumulative effect is likely to be acceptable since useful predicate 
functionality can be created with code that runs in just a few milliseconds.) 

The adverse impact of a safeguard is proportional to the frequency with which it is 
utilized in the system’s workload. Given a typical workload, we can count the frequency 
f{p\) with which permission p\ is requested in the workload. This can be done for 
all permissions. The cost of utilizing subset p{Q{P)) for risk reduction can then be 
calculated with: 

C(p(f?(P))) = ^ /(pa) (18) 

px&p(n(p)) 

Similarly, if the safeguards of subset p{<F{P)) are relaxed, the resulting reduction in 
runtime cost can be calculated with: 



cipmp))) = E (19) 

P>ep('f'(P)) 

The ideal choice of safeguards will minimize the impact on performance, while 
simultaneously ensuring that the risk remains below the threshold of tolerance. Thus, 
for risk reduction we wish to find: 



mmC{p{Q{P))), TZ" <TZo (20) 

In the context of risk relaxation, we wish to find: 

maxC(p(f(P))), TZ"<TZo (21) 

Both these problems are equivalent to the NP-complete 0-1 Knapsack Problem. 
Although approximation algorithms exist [Kellerer98], they are not suitable for our use 
since we need to make a choice in real-time. Instead, we will use a heuristic which 
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guarantees that the risk is maintained helow the threshold. The heuristic is based on the 
greedy algorithm for the 0-1 Knapsack Problem which picks the item with the highest 
benefit-to-cost ratio repeatedly till the knapsack’s capacity is reached. This yields a 
solution that is always within a factor of 2 of the optimal choice [Garey79]. 



6.1 Response Heuristic 

When the risk needs to be reduced, the heuristic uses the greedy strategy of picking 
the response primitive with the highest benefit-to-cost ratio repeatedly till the constraint 
is satisfied. By maintaining the choices in a heap data structure keyed on the benefit- 
to-cost ratio, each primitive in the response set can be chosen in 0(1) time. This is 
significant since implementing a single response primitive is often sufficient for dis- 
rupting an attack in progress. When the risk needs to be relaxed, the active safeguards 
with the highest cost-to-benefit ratios can be selected since these will be yield the best 
improvement to system performance. A separate heap is utilized to maintain these. 



Risk Reduction. We outline the algorithm for the case where the risk needs to be 
reduced. The first two steps constitute pre-processing and therefore only occur during 
system initialization. Risk relaxation is analogous and therefore not described explicitly. 

Step 1 The benefit- to-cost ratio of each candidate safeguard permission PA G ^{P) 

can be calculated by: 

«(pa)= ^ {T(f„)x (22) 

tc.ps&(P(tc)<^0(P)) 

.(px)x (l-„-fo)) ^ 

f{p\) 

Step 2 The response set is defined as empfy, fhaf is p(f7(P)) = (j). 

Step 3 The single risk reducing measure with the highest benefit-to-cost can be 
selected, that is: 



Pmax = max k(pa), P\ G ^i{P) (23) 

The permission is added to p(f7(P)). 

Step 4 The risk before the candidate responses were utilized is IZa ■ If the responses 
were activated the resulting risk TZ" is given by: 

n” =lla - Y. «( pa ) x /( pa ) ( 24 ) 

px6p(t2(P)) 

This is equivalent to using Equations 14 and 15. While the worst case complex- 
ity is the same, when few protective measures are added the cost of the above 
calculation is significantly lower. 
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Step 5 If TZ" > TZq then the system repeats the above from Step 3 onwards. If 
< 7^0 then proceed to the next step. 

Step 6 The set of safeguards p{fl{P)) must be activated and p{fl{P)) should be 
transferred from fi{P) to 'f'(f’). 

The time complexity is 0{\p{Q{P))\). In the worst case, this is 0(|17(P)|) < 
0{\P\). Unless a large variety of attacks are simultaneously launched against the target, 
the response set will be small. 

Risk Relaxation. In the case of risk relaxation, the algorithm becomes: 

Step 1 For Pa € !?'(P) calculate: 

<px)= Y. {ntc)x (25) 

i<,:pxe(P(tc)nf'(P)) 

,(p>) X (1 - .-(w)) ^ 

|P(t.)l 

f(P\) 

Step 2 Set p(tf'(P)) = (f>- 

Step 3 Find the safeguard which yields the least risk reduction per instance of use: 
Rmin = min k(pa), p\&^{P) (26) 

Add it to p(P(P)). 

Step 4 Calculate 7Z" : 

n " = TZa + Y «(pa)x/(pa) (27) 

Pxep(f(P)) 

Step 5 If 71" < TZo, repeat from Step 3. If 7Z” = TZq, proceed to the next step. If 
7Z" > 7Z[), undo the last iteration of Step 3. 

Step 6 Relax all measures in p{'P{P)) and transfer them to P2{P). 

1 Implementation 

Our prototype augments Sun’s Java Runtime Environment (version 1.4.2) running on 
Redhat Linux 9 (with kernel 2.4.20). Security in the Java 2 model is handled by the 
AccessController class, which in turn invokes the legacy SecurityManager. We in- 
strumented the latter to invoke an Initialize class which constructs and initializes an 
instance of the RheoStat class [Gehani03]. A shutdown hook is also registered with 
the SecurityManager so that the intrusion detector is terminated when the user appli- 
cation exits. The ActiveMonitor class initializes the first time the AccessController’s 
checkPermission( ) method is invoked. 

RheoStat implements a limited state transition analysis intrusion detector, based on 
the methodology of [Ilgun95]. It adds a pre-match timer and post-match timer for each 
signature. The first handles false partial matches by reseting the signature if the match 
doesn’t complete in a pre-determined interval. The second is used to reset the system 
after a signature match completes and sufficient time has elapsed to deem that the threat 
has passed. 
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Activate response 





Fig. 4. When the Risk Manager needs to activate a response to effect risk reduction, it attempts to 
select the one which will minimize the runtime overhead while maximizing the risk reduction. 



7.1 Risk Manager 

A RiskManager class uses events generated by RheoStat and invokes the ActiveMon- 
itor’s enableSafeguardO and disableSafeguard() methods as required. The fx matching 
function, described in Section 4.1, is used for estimating the threat from a partial match. 
We use: 

( 28 ) 

The new risk level is calculated with the updated threat levels. If the risk level increases 
above or decreases below the risk tolerance threshold, the following course of action 
occurs. 

The RiskManager maintains two heap data structures as illustrated in Figure 4. The 
first one contains all the permissions for which the ActiveMonitor has predicates but 
are currently unused. The objects are stored in the heap using the cost-to-benefit ratios 
as the keys. The second heap contains all the permissions for which the ActiveMonitor 
is currently evaluating predicates before it grants permissions. The objects in this heap 
are keyed by the benefit-to-cost ratios. When the risk level rises, the RiskManager 
extracts the minimum value element from the first heap, and inserts it into the second 
heap. The corresponding predicate evaluation is activated. The risk level is updated. If 
it remains above the risk tolerance threshold the process is repeated until the risk has 
reduced sufficiently. Similarly, when an event causes the risk to drop, the RiskManager 
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extracts the minimum element repeatedly from the second heap, inserting it into the first 
heap, disabling predicate checks for the permission, while the risk remains below the 
threshold of tolerance. In this manner, the system is able to adapt its security posture 
continuously. 

8 Evaluation 

The NIST ICAT database [ICAT] contains information on over 6, 200 vulnerabilities in 
application and operating system software from a range of sources. These are primarily 
classihed into seven categories. Based on the database, we have constructed a suite of 
attacks, with each attack illustrating the exploitation of a vulnerability from a different 
category. In each case, the system component which includes the vulnerability is a Java 
servlet that we have created and installed in the W3C’s Jigsaw web server (version 
2.2.2) [Jigsaw]. While our approach is general, we focus below only on 4 categories for 
brevity. We describe a scenario that corresponds to each attack, including a description 
of the vulnerability that it exploits, the intrusion signature used to detect it and the way 
the system responds. The global risk tolerance threshold is set at 20. 

8.1 Configuration Error 

A configuration error introduces a vulnerability into the system due to setting that are 
controlled by the user. Although the configuration is implemented faithfully by the sys- 
tem, it allows the security policy to be subverted. In our example, the servlet authen- 
ticates the user before granting access to certain documents. The password used is a 
permutation of the username. As a result, an attacker can guess the password after a 
small number of attempts. The flaw here is the weak configuration. 

When the following sequence of Anack exploiting a configuration Error 

events is detected, an attack that exploits “ ' 

this vulnerability is deemed to have oc- “ . • ’ . 

curred. First, the web server accepts a ’ 

connection to port 8001. Second, it serves .g u s _ • ' 

the specific HTML document which in- ' 

eludes the form which requests authenti- 

cation information as well as the desired ^ 

document. Third, the server receives an- p j 

0 1 2 3 4 5 6 7 8 9 10 n 12 13 14 15 16 17 18 19 20 21 22 

other connection. Fourth, the servlet that intrusion and response events 

verifies if the file can be served to the . ... 

, , , , . . . n Tig. 5. Attack exploiting a configuration error, 

client, based on the authentication infor- 
mation provided, will execute. Fifth, the decision to deny the request is logged. If this 
sequence of events repeats twice again within the pre-match timeout of the signature, 
which is 1 minute, an intrusion attempt is deemed to have occurred. 

In Figure 5, events 7 — 18 and 20 — 22 correspond to this signature. Events 1 — 6 are 
of other signatures that cause the risk level to rise. Event 18 causes the risk threshold 
to be crossed. As a result, the RiskManager searches for and finds the risk reduction 
measure which has the lowest cost-benefit ratio. The system enables a predicate for 
the permission that controls whether the servlet can be executed. This is event 19 and 
reduces the risk. The predicate checks whether the current time is within the range of 



the weak configuration. 

Attack exploiting a Configuration Error 

22.5- 

20 - . * 

17.5- . * ^ ' 

15- • * 

in 12.5- ^ * 

5 • * 

10- , * 

7.5- 
5- 

2.5- 

0-J^ — ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ — 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 

Intrusion and response events 

Fig. 5. Attack exploiting a configuration error. 
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business operating hours. It allows the permission to be granted only if it evaluates to 
true. During operating hours, it is likely that the intrusion will be flagged and seen by 
an administrator and it is possible that the event sequence occurred accidentally, so the 
permission continues to be granted. Outside those hours, it is likely that this is an attack 
attempt and no administrator is present, so the permission is denied thereafter, till the 
post-match timer expires after one hour and the threat is reset. 

8.2 Design Error 

A design error is a flaw that introduces a weakness in the system despite a safe config- 
uration and correct implementation. In our example, the servlet allows a remote node 
to upload data to the server. The configuration specifies the maximum size file that can 
be uploaded. The servlet implementation ensures that each file uploaded is limited to 
the size specified in the configuration. However, the design of the restriction did not ac- 
count for the fact that repeated uploads can be performed by the same remote node. This 
effectively allows an attacker to launch the very denial-of-service (that results when the 
disk is filled) that was being guarded against when the upload file size was limited. 

When the following sequence of events is detected, an attack that exploits this vul- 
nerability is deemed to have occurred. First, the weh server accepts a connection to 
port 8001. Second, it serves the specific HTML document which includes the form that 
allows uploads. Third, the server receives another connection. Fourth, it executes the 
servlet that accepts the upload and limits its size. Fifth, a file is written to the uploads 
directory. If this sequence of events repeats twice again within the pre-match timeout 
of the signature, which is 1 minute, an intrusion attempt is deemed to have occurred. 

In Figure 6, events 7 — 21 correspond 
to this signature. Events 1 - 6 are of Anack exploiting a Design Error 

other signatures that cause the risk level 
to rise. Event 21 causes the risk thresh- 
old to be crossed. The system responds 
by enabling a predicate for the permission 
that controls whether files can be writ- 
ten to the uploads directory. This is event 
22 and reduces the risk. The predicate 
checks whether the current time is within 

therangeofbusinessoperatinghours.lt ir- /: i i j ■ 

® e b tig. 6. Attack exploiting a design error. 

allows the permission to be granted only 

if it evaluates to true. During operating hours, it is likely that the denial-of-service 
attempt will be flagged and seen by an administrator, so the permission continues to 
be granted on the assumption that manual response will occur. Outside those hours, it 
is likely that no administrator is present, so the permission is denied thereafter, till the 
post-match timer expires after one hour and the threat is reset. 

8.3 Environment Error 

An environment error is one where an assumption is made about the runtime environ- 
ment which does not hold. In our example, the servlet authenticates a user, then stores 
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Fig. 6. Attack exploiting a design error. 
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the user’s directory in a cookie that is returned to the client. Subsequent responses uti- 
lize the cookie to determine where to serve files from. The flaw here is that the server 
assumes the environment of the cookie is safe, which it is not since it is exposed to 
manipulation by the client. An attacker can exploit this by altering the cookie’s value to 
reflect a directory that they should not have access to. 

When the following sequence of 

events is detected, an attack that exploits "" Environment Error 

this vulnerability is deemed to have oc- 20 

curred. First, the web server accepts a . • 

connection to port 8001. Second, it serves 
the specific HTML document which in- 2 
eludes the form that authenticates a user. 2= 

Third, the server receives another connec- ^ - 

tion. Fourth, it executes the servlet that * 

authenticates the user and maps users to ”« 1 2 3 , . e 7 » r. 10 n 12 u » is le 

r Intrusion and response events 

the directories that they are allowed to ac- 
cess. It sets a cookie which includes the exploiting an environment error. 

directory from which files will be retrieved for further requests. Fifth, the server re- 
ceives another connection. Sixth, it serves the specific HTML document that includes 
the form which accepts the file request. Seventh, the server receives another connection. 
Eighth, the servlet that processes the request, based on the form input as well as the 
cookie data, is executed. Ninth, a file is served from a directory that was not supposed 
to be accessible to the user. The events must all occur within the pre-match timeout of 
the signature, which is 1 minute. 

In Figure 7, event 3, events 8 — 14 and event 16 correspond to this signature. Events 
1 — 2 and 4 — 7 are of other signatures. Event 14 causes the risk threshold to be crossed. 
The system responds by enabling a predicate for the permission that controls whether 
the file download servlet can be executed. This is event 15 and reduces the risk. The 
predicate simply denies the permission. As a result, the attack can not complete since 
no more files can be downloaded till the safeguard is removed when the risk reduces at 
a later point in time (when a threat’s timer expires). 

8.4 Input Validation Error 

An input validation error is one that results from the failure to conduct necessary checks 
on the data. A common example of this type of error is the failure to check that the data 
passed in is of length no greater than that off the buffer in which it is stored. The result 
is a buffer overflow which can be exploited in a variety of ways. In our example, the 
servlet allows a file on the server to be updated remotely. The path of the target file is 
parsed and a check is performed to verify that it is in a directory that can be updated. 
The file ’Password.cfg’ is used in each directory to describe which users may access 
it. By uploading a file named ’Password.cfg’, an attacker can overwrite and alter the 
access configuration of the directory. As a result, they can gain unlimited access to the 
other data in the directory. 

When the following sequence of events is detected, an attack that exploits this vul- 
nerability is deemed to have occurred. First, the web server accepts a connection to 
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Fig. 7. Attack exploiting an environment error. 
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Attack exploiting an Input Validation Error 



port 8001. Second, it serves the specific HTML document which includes the form that 
allows uploads to selected directories. Third, the server receives another connection. 
Fourth, it executes the servlet that checks that the uploaded file is going to a legal di- 
rectory. Fifth, the ’Passwords. cfg’ file in the uploads directory is written to. The events 
must all occur within the pre-match timeout of the signature, which is 1 minute. 

In Figure 8, event 1 and events 7 — 10 
correspond to this signature. Events 2 — 6 
are of other signatures. Event 10 causes 
the risk threshold to be crossed. The sys- 
tem responds by enabling a predicate for 
the permission that controls write access 
to the ’Passwords. cfg’ file in the uploads 
directory. This is event 11 and reduces the 
risk. The predicate simply denies the per- 
mission. As a result, the attack can not 
complete since the last step requires this 
permission to upload and overwrite the 
’Passwords. cfg’ file. Enabling this safe- 
guard does not affect legitimate uploads since they do not need to write to this file. 




Intrusion and response events 

Fig. 8. Attack exploiting an input validation er- 
ror. 



9 Related Work 

We describe below the relationship of our work to previous research on intrusion detec- 
tors and risk management systems. 



9.1 Intrusion Detection 

Early systems developed limited ad-hoc responses, such as limiting access to a user’s 
home directory or logging the user out [Bauer88], or terminating network connections 
[Pooch96]. This has also been the approach of recent commercial systems. For exam- 
ple, BlackICE [BlackICE] allows a network connection to be traced, Intruder Alert 
[IntruderAlert] allows an account to be locked, NetProwler [NetProwler] can update 
firewall rules, NetRanger [Cisco] can reset TCP connections and RealSecure [ISS] can 
terminate user processes. 

Frameworks have been proposed for adding response capabilities. DCA [Fisch96] 
introduced a taxonomy for response and a tool to demonstrate the utility of the taxon- 
omy. emerald’s [Porras97] design allows customized responses to be invoked au- 
tomatically, but does not define them by default. AAIR [CarverOl] describes an expert 
system for response based on an extended taxonomy. 

Our approach creates a framework for systematically choosing a response in real- 
time, based on the goal of reducing exposure by reconfiguring the access control sub- 
system. This allows an attack to be contained automatically instead of being limited to 
raising an alarm, and does not require a new response subsystem to be developed for 
each new class of attack discovered. 
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9.2 Risk Management 

Risk analysis has been utilized to manage the security of systems for several decades 
[FIPS31]. However, its use has been limited to offline risk computation and manual re- 
sponse. [S 00 H 00 O 2 ] proposes a general model using decision analysis to estimate com- 
puter security risk and automatically update input estimates. [Bilar03] uses reliability 
modeling to analyze the risk of a distributed system. Risk is calculated as a function 
of the probability of faults being present in the system’s constituent components. Risk 
management is framed as an integer linear programming problem, aiming to find an 
alternate system configuration, subject to constraints such as acceptable risk level and 
maximum cost for reconfiguration. 

In contrast to previous approaches, we use the risk computation to drive changes in 
the operating system’s security mechanisms. This allows risk management to occur in 
real-time and reduces the window of exposure. 

10 Future Directions 

We utilized a simple /x function that assumed independent probabilities for successive 
events. However, ^ functions can be defined even when pre-conditions are known. By 
measuring the frequencies of successive events occurring in typical and attacked work- 
loads, conditional probabilities can be derived. A tool to automate the process could be 
constructed. 

The exposure reduction values, workload frequencies, consequence costs and risk 
threshold were all manually calculated in our prototype. All such parameters will need 
to be automatically derived for our approach to be practical. The frequencies with which 
permissions are utilized can be estimated by instrumenting the system to measure these 
with a typical workload. 

A similar approach could be used to determine the average inherent risk of a work- 
load. An alternative would be the creation of a tool to visualize the effect of varying 
the risk threshold on (i) the performance of the system and (ii) the cost of intrusions 
that could successfully occur below the risk threshold. Policy would then dictate the 
trade-off point chosen. 

The problem of labeling data with associated consequence values can be addressed 
with a suitable user interface augmentation - for example, it could utilize user input 
when new files are being created by application software. The issue could also be par- 
tially mitigated by using pre-configured values for all system files. 

Finally, some attacks may utilize few or no permission checks. Such scenarios fall 
into two classes. In the first case, this points to a design shortcoming where new per- 
missions need to be introduced to guard certain resources such as critical subroutines 
in system code. The other case is when the attack has a very small footprint, in which 
case our approach will fail (as it can’t recognize the threat in advance). 

11 Conclusion 

We have introduced a formal framework for managing the risk posed to a host. The 
model calculates the risk based on the threats, exposure to the threats and consequences 
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of the threats. The threat likelihoods are estimated in real-time using output from an 
intrusion detector. The risk is managed by altering the the exposure of the system. This 
is done by dynamically reconfiguring the modihed access control subsystem. The utility 
of the approach is illustrated with a set of attack scenarios in which the risk is managed 
in real-time and results in the attacks being contained. Automated configuration of the 
system’s parameters, either analytically or empirically, remains an open research area. 
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